A model-based evaluation of data quality activities in KDD

被引:26
|
作者
Mezzanzanica, Mario [1 ,2 ]
Boselli, Roberto [1 ,2 ]
Cesarini, Mirko [1 ,2 ]
Mercorio, Fabio [2 ]
机构
[1] Univ Milano Bicocca, Dept Stat & Quantitat Methods, I-20126 Milan, Italy
[2] Univ Milano Bicocca, CRISP Res Ctr, I-20126 Milan, Italy
关键词
Data quality; Data cleansing; Model checking; Real-life application; CHECKING; KNOWLEDGE;
D O I
10.1016/j.ipm.2014.07.007
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We live in the Information Age, where most of the personal, business, and administrative data are collected and managed electronically. However, poor data quality may affect the effectiveness of knowledge discovery processes, thus making the development of the data improvement steps a significant concern. In this paper we propose the Multidimensional Robust Data Quality Analysis, a domain-independent technique aimed to improve data quality by evaluating the effectiveness of a black-box cleansing function. Here, the proposed approach has been realized through model checking techniques and then applied on a weakly structured dataset describing the working careers of millions of people. Our experimental outcomes show the effectiveness of our model-based approach for data quality as they provide a fine-grained analysis of both the source dataset and the cleansing procedures, enabling domain experts to identify the most relevant quality issues as well as the action points for improving the cleansing activities. Finally, an anonymized version of the dataset and the analysis results have been made publicly available to the community. (C) 2014 Elsevier Ltd. All rights reserved.
引用
收藏
页码:144 / 166
页数:23
相关论文
共 50 条
  • [21] Effects of wetland restoration on drinking water quality: Model-based evaluation with radon-222 and chloride data
    Zechner, Eric
    Huggenberger, Peter
    Wülser, Richard
    Geissbühler, Urs
    Wüthrich, Christoph
    IAHS-AISH Publication, 2002, (277): : 431 - 438
  • [22] Model-based Performance Evaluation of Batch and Stream Applications for Big Data
    Kross, Johannes
    Krcmar, Helmut
    2017 IEEE 25TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS, AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS (MASCOTS), 2017, : 80 - 86
  • [23] Navigation Sensor Data Reliability Model-Based on Self-Evaluation and Mutual Evaluation
    Li, Wenqiang
    Zhang, Zhongxuan
    Liang, Yi
    Shen, Feng
    Gao, Wei
    Xu, Dingjie
    IEEE INTERNET OF THINGS JOURNAL, 2023, 10 (23) : 20735 - 20745
  • [24] A Tool to Support Model-Based Testing Activities
    Doi Junior, Gilson
    Bonifacio, Adilson Luiz
    2011 BRAZILIAN SYMPOSIUM ON COMPUTING SYSTEM ENGINEERING (SBESC), 2011, : 21 - 26
  • [25] MODEL-BASED QUANTIFICATION OF IMAGE QUALITY
    HAZRA, R
    MILLER, KW
    PARK, SK
    VISUAL INFORMATION PROCESSING FOR TELEVISION AND TELEROBOTICS, 1989, 3053 : 11 - 22
  • [26] Model-based quality assurance in radiology
    Haug, PJ
    Frederick, PR
    Christensen, L
    Haug, SJ
    Farney, M
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 1999, : 1074 - 1074
  • [27] A MODEL-BASED APPROACH FOR DETERMINING DATA QUALITY METRICS IN COMBUSTION PRESSURE MEASUREMENT
    Rogers, D. R.
    Mason, B. A.
    Pezouvanis, A.
    Ebrahimi, M. K.
    COMBUSTION SCIENCE AND TECHNOLOGY, 2015, 187 (04) : 627 - 641
  • [28] Faithful Model Evaluation for Model-Based Metrics
    Goyal, Palash
    Hu, Qian
    Gupta, Rahul
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 7484 - 7489
  • [29] Model-based clustering of longitudinal data
    McNicholas, Paul D.
    Murphy, T. Brendan
    CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE, 2010, 38 (01): : 153 - 168
  • [30] Boosting for model-based data clustering
    Saffari, Amir
    Bischof, Horst
    PATTERN RECOGNITION, 2008, 5096 : 51 - 60