From Theory to Practice: A Data Quality Framework for Classification Tasks

被引:18
|
作者
Camilo Corrales, David [1 ,2 ]
Ledezma, Agapito [2 ]
Carlos Corrales, Juan [1 ]
机构
[1] Univ Cauca, Grp Ingn Telemat, Campus Tulcan, Popayan 190002, Colombia
[2] Univ Carlos III Madrid, Dept Informat, Ave Univ 30, Leganes 28911, Spain
来源
SYMMETRY-BASEL | 2018年 / 10卷 / 07期
关键词
DQF4CT; data quality issue; classification task; conceptual framework; data cleaning ontology; FEATURE-SELECTION; VERTEBRAL COLUMN; ONTOLOGIES; KNOWLEDGE; MODELS; PRINCIPLES; IMPUTATION; NOISE;
D O I
10.3390/sym10070248
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The data preprocessing is an essential step in knowledge discovery projects. The experts affirm that preprocessing tasks take between 50% to 70% of the total time of the knowledge discovery process. In this sense, several authors consider the data cleaning as one of the most cumbersome and critical tasks. Failure to provide high data quality in the preprocessing stage will significantly reduce the accuracy of any data analytic project. In this paper, we propose a framework to address the data quality issues in classification tasks DQF4CT. Our approach is composed of: (i) a conceptual framework to provide the user guidance on how to deal with data problems in classification tasks; and (ii) an ontology that represents the knowledge in data cleaning and suggests the proper data cleaning approaches. We presented two case studies through real datasets: physical activity monitoring (PAM) and occupancy detection of an office room (OD). With the aim of evaluating our proposal, the cleaned datasets by DQF4CT were used to train the same algorithms used in classification tasks by the authors of PAM and OD. Additionally, we evaluated DQF4CT through datasets of the Repository of Machine Learning Databases of the University of California, Irvine (UCI). In addition, 84% of the results achieved by the models of the datasets cleaned by DQF4CT are better than the models of the datasets authors.
引用
收藏
页数:29
相关论文
共 50 条
  • [41] Integrated sustainability and resilience assessment framework: From theory to practice
    Roostaie, S.
    Nawari, N.
    Kibert, C. J.
    JOURNAL OF CLEANER PRODUCTION, 2019, 232 : 1158 - 1166
  • [42] Theory and practice of the proposed conceptual framework: Evidence from the field
    Yong, Kevin Ow
    Lim, Chu Yeong
    Tan, Pearl
    ADVANCES IN ACCOUNTING, 2016, 35 : 62 - 74
  • [43] Improving the sustainable retirement village framework: From theory to practice
    Hu, Xin
    Xia, Bo
    Chong, Heap-Yih
    Skitmore, Martin
    Buys, Laurie
    JOURNAL OF CLEANER PRODUCTION, 2020, 248
  • [44] Quality Indicators in Laboratory Medicine: from theory to practice
    Sciacovelli, Laura
    O'Kane, Maurice
    Skaik, Younis Abdelwahab
    Caciagli, Patrizio
    Pellegrini, Cristina
    Da Rin, Giorgio
    Ivanov, Agnes
    Ghys, Timothy
    Plebani, Mario
    CLINICAL CHEMISTRY AND LABORATORY MEDICINE, 2011, 49 (05) : 835 - 844
  • [45] Quality specifications in EQA schemes: from theory to practice
    Sciacovelli, L
    Zardo, L
    Secchiero, S
    Plebani, M
    CLINICA CHIMICA ACTA, 2004, 346 (01) : 87 - 97
  • [46] From Theory to Practice Concerning Air Quality Monitoring
    Ionel, Ioana
    Makra, Laszlpo
    Bisorca, Daniel
    Calinoiu, Delia Gabriela
    Balogh, Ramon-Mihai
    TIM15-16 PHYSICS CONFERENCE, 2017, 1796
  • [47] High quality imaging in museum: From theory to practice
    Maitre, H
    Schmitt, FJM
    Crettez, JP
    VERY HIGH RESOLUTION AND QUALITY IMAGING II, 1997, 3025 : 30 - 39
  • [48] Translating Visual Short-Term Memory Binding Tasks to Clinical Practice: From Theory to Practice
    Pavisic, Ivanna M.
    Suarez-Gonzalez, Aida
    Pertzov, Yoni
    FRONTIERS IN NEUROLOGY, 2020, 11
  • [49] A hybrid framework using SOM and fuzzy theory for textual classification in data mining
    Chen, YPP
    MODELLING WITH WORDS: LEARNING, FUSION, AND REASONING WITHIN A FORMAL LINGUISTIC REPRESENTATION FRAMEWORK, 2003, 2873 : 153 - 167
  • [50] Sequence Theory for Classification in Multi-label ATR classification Tasks
    Kabban, Christine M. Schubert
    Oxley, Mark E.
    SIGNAL PROCESSING, SENSOR/INFORMATION FUSION, AND TARGET RECOGNITION XXVIII, 2019, 11018