From Theory to Practice: A Data Quality Framework for Classification Tasks

被引:18
|
作者
Camilo Corrales, David [1 ,2 ]
Ledezma, Agapito [2 ]
Carlos Corrales, Juan [1 ]
机构
[1] Univ Cauca, Grp Ingn Telemat, Campus Tulcan, Popayan 190002, Colombia
[2] Univ Carlos III Madrid, Dept Informat, Ave Univ 30, Leganes 28911, Spain
来源
SYMMETRY-BASEL | 2018年 / 10卷 / 07期
关键词
DQF4CT; data quality issue; classification task; conceptual framework; data cleaning ontology; FEATURE-SELECTION; VERTEBRAL COLUMN; ONTOLOGIES; KNOWLEDGE; MODELS; PRINCIPLES; IMPUTATION; NOISE;
D O I
10.3390/sym10070248
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The data preprocessing is an essential step in knowledge discovery projects. The experts affirm that preprocessing tasks take between 50% to 70% of the total time of the knowledge discovery process. In this sense, several authors consider the data cleaning as one of the most cumbersome and critical tasks. Failure to provide high data quality in the preprocessing stage will significantly reduce the accuracy of any data analytic project. In this paper, we propose a framework to address the data quality issues in classification tasks DQF4CT. Our approach is composed of: (i) a conceptual framework to provide the user guidance on how to deal with data problems in classification tasks; and (ii) an ontology that represents the knowledge in data cleaning and suggests the proper data cleaning approaches. We presented two case studies through real datasets: physical activity monitoring (PAM) and occupancy detection of an office room (OD). With the aim of evaluating our proposal, the cleaned datasets by DQF4CT were used to train the same algorithms used in classification tasks by the authors of PAM and OD. Additionally, we evaluated DQF4CT through datasets of the Repository of Machine Learning Databases of the University of California, Irvine (UCI). In addition, 84% of the results achieved by the models of the datasets cleaned by DQF4CT are better than the models of the datasets authors.
引用
收藏
页数:29
相关论文
共 50 条
  • [21] From service quality to service theory and practice
    Ranaweera, Chatura
    Sigala, Marianna
    JOURNAL OF SERVICE THEORY AND PRACTICE, 2015, 25 (01) : 2 - 9
  • [22] Clustering for Data Privacy and Classification Tasks
    Schebesch, Klaus B.
    Stecking, Ralf
    OPERATIONS RESEARCH PROCEEDINGS 2013, 2014, : 397 - +
  • [23] Optimizing Data Transformations for Classification Tasks
    Valls, Jose M.
    Aler, Ricardo
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING, PROCEEDINGS, 2009, 5788 : 176 - 183
  • [24] Hydraulic Problems in Flooding: From Data to Theory and from Theory to Practice
    Knight, Donald
    EXPERIMENTAL AND COMPUTATIONAL SOLUTIONS OF HYDRAULIC PROBLEMS, 2013, : 19 - 52
  • [25] A Conceptual Framework for Data Quality in Knowledge Discovery Tasks (FDQ-KDT): A Proposal
    Camilo Corrales, David
    Ledezma, Agapito
    Carlos Corrales, Juan
    JOURNAL OF COMPUTERS, 2015, 10 (06) : 396 - 405
  • [26] Knowledge discovering from clinical data based on classification tasks solving
    Ignat'ev, NA
    Adilova, FT
    Matlatipov, GR
    Chernysh, PP
    MEDINFO 2001: PROCEEDINGS OF THE 10TH WORLD CONGRESS ON MEDICAL INFORMATICS, PTS 1 AND 2, 2001, 84 : 1354 - 1358
  • [27] A framework for learning predictive structures from multiple tasks and unlabeled data
    Ando, RK
    Zhang, T
    JOURNAL OF MACHINE LEARNING RESEARCH, 2005, 6 : 1817 - 1853
  • [28] Linked Ethnographic Data: From Theory to Practice
    DiFranzo, Dominic
    Gloria, Marie Joan Kristine
    Hendler, James
    WWW'15 COMPANION: PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 2015, : 359 - 360
  • [29] Data Mesh Architecture: from Theory to Practice
    Kumara, Indika
    Driessen, Stefan
    van Eijk, Tom
    Di Nucci, Dario
    Tamburri, Damian Andrew
    van den Heuvel, Willem-Jan
    IEEE 21ST INTERNATIONAL CONFERENCE ON SOFTWARE ARCHITECTURE COMPANION, ICSA-C 2024, 2024, : 375 - 376
  • [30] Classification Framework for Context Data from Business Processes
    Moehring, Michael
    Schmidt, Rainer
    Haerting, Ralf-Christian
    Baer, Florian
    Zimmermann, Alfred
    BUSINESS PROCESS MANAGEMENT WORKSHOPS( BPM 2014), 2015, 202 : 440 - 445