Similarity-driven sampling for data mining

被引:0
|
作者
Reinartz, T [1 ]
机构
[1] Daimler Benz AG, Res & Technol, D-89013 Ulm, Germany
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Industrial databases often contain millions of tuples but most data mining algorithms suffer from limited applicability to only small sets of examples. In this paper, we propose to utilize data reduction before data mining to overcome this deficit. We specifically present a novel similarity-driven sampling approach which applies two preparation steps, sorting and stratification, and reuses an improved variant of leader clustering. We experimentally evaluate similarity-driven sampling in comparison to statistical sampling techniques in different classification domains using C4.5 and instance-based learning as data mining algorithms. Experimental results show that similarity-driven sampling often outperforms statistical sampling techniques in terms of error rates using smaller samples.
引用
收藏
页码:423 / 431
页数:9
相关论文
共 50 条
  • [31] Theoretical sampling for data mining
    Lin, TY
    DATA MINING AND KNOWLEDGE DISCOVERY: THEORY, TOOLS, AND TECHNOLOGY II, 2000, 4057 : 192 - 200
  • [32] AVLaughterCycleEnabling a virtual agent to join in laughing with a conversational partner using a similarity-driven audiovisual laughter animation
    Jérôme Urbain
    Radoslaw Niewiadomski
    Elisabetta Bevacqua
    Thierry Dutoit
    Alexis Moinet
    Catherine Pelachaud
    Benjamin Picart
    Joëlle Tilmanne
    Johannes Wagner
    Journal on Multimodal User Interfaces, 2010, 4 : 47 - 58
  • [33] AVLaughterCycle Enabling a virtual agent to join in laughing with a conversational partner using a similarity-driven audiovisual laughter animation
    Urbain, Jerome
    Niewiadomski, Radoslaw
    Bevacqua, Elisabetta
    Dutoit, Thierry
    Moinet, Alexis
    Pelachaud, Catherine
    Picart, Benjamin
    Tilmanne, Joelle
    Wagner, Johannes
    JOURNAL ON MULTIMODAL USER INTERFACES, 2010, 4 (01) : 47 - 58
  • [34] New similarity rules for mining data
    Di Gesù, V
    Friedman, JH
    NEURAL NETS, 2006, 3931 : 179 - 187
  • [35] Similarity management for fuzzy data mining
    Bouchon-Meunier, Bernadette
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS AND KNOWLEDGE ENGINEERING (ISKE 2007), 2007,
  • [36] Similarity-Driven Adaptive Prototypical Network for Class-incremental Few-shot Named Entity Recognition
    Chen, Yifan
    Huang, Zhan
    Hu, Minghao
    Li, Dongsheng
    Wang, Changjian
    Wang, Ankun
    Wang, Boyang
    Lu, Xicheng
    2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 219 - 227
  • [37] A similarity-based approach to sampling absence data for landslide susceptibility mapping using data-driven methods
    Zhu, A-Xing
    Miao, Yamin
    Liu, Junzhi
    Bai, Shibiao
    Zeng, Canying
    Ma, Tianwu
    Hong, Haoyuan
    CATENA, 2019, 183
  • [38] Sequence similarity-driven proteomics in organisms with unknown genomes by LC-MS/MS and automated de novo sequencing
    Waridel, Patrice
    Frank, Ari
    Thomas, Henrik
    Surendranath, Vineeth
    Sunyaev, Shamil
    Pevzner, Pavel
    Shevchenko, Andrej
    PROTEOMICS, 2007, 7 (14) : 2318 - 2329
  • [39] A Geometric View of Similarity Measures in Data Mining
    Darvishi, A.
    Hassanpour, H.
    INTERNATIONAL JOURNAL OF ENGINEERING, 2015, 28 (12): : 1728 - 1737
  • [40] Similarity discovery techniques in temporal data mining
    Pan, Ding
    Shen, Jun-Yi
    Ruan Jian Xue Bao/Journal of Software, 2007, 18 (02): : 246 - 258