An empirical analysis of data preprocessing for machine learning-based software cost estimation

被引:120
|
作者
Huang, Jianglin [1 ]
Li, Yan-Fu [2 ]
Xie, Min [1 ]
机构
[1] City Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Hong Kong, Peoples R China
[2] CentraleSupelec, Dept Ind Engn, Paris, France
关键词
Software cost estimation; Data preprocessing; Missing-data treatments; Scaling; Feature selection; Case selection; SUPPORT VECTOR REGRESSION; MISSING DATA; MUTUAL INFORMATION; FEATURE-SELECTION; PREDICTION; MODELS; IMPUTATION; WEIGHTS; SIZE;
D O I
10.1016/j.infsof.2015.07.004
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Context: Due to the complex nature of software development process, traditional parametric models and statistical methods often appear to be inadequate to model the increasingly complicated relationship between project development cost and the project features (or cost drivers). Machine learning (ML) methods, with several reported successful applications, have gained popularity for software cost estimation in recent years. Data preprocessing has been claimed by many researchers as a fundamental stage of ML methods; however, very few works have been focused on the effects of data preprocessing techniques. Objective: This study aims for an empirical assessment of the effectiveness of data preprocessing techniques on ML methods in the context of software cost estimation. Method: In this work, we first conduct a literature survey of the recent publications using data preprocessing techniques, followed by a systematic empirical study to analyze the strengths and weaknesses of individual data preprocessing techniques as well as their combinations. Results: Our results indicate that data preprocessing techniques may significantly influence the final prediction. They sometimes might have negative impacts on prediction performance of ML methods. Conclusion: In order to reduce prediction errors and improve efficiency, a careful selection is necessary according to the characteristics of machine learning methods, as well as the datasets used for software cost estimation. (C) 2015 Elsevier B.V. All rights reserved.
引用
收藏
页码:108 / 127
页数:20
相关论文
共 50 条
  • [21] Editorial: Machine Learning-Based Methods for RNA Data Analysis
    Peng, Lihong
    Yang, Jialiang
    Wang, Minxian
    Zhou, Liqian
    FRONTIERS IN GENETICS, 2022, 13
  • [22] Machine Learning-Based Imputation Approach with Dynamic Feature Extraction for Wireless RAN Performance Data Preprocessing
    Dahj, Jean Nestor M.
    Ogudo, Kingsley A. A.
    SYMMETRY-BASEL, 2023, 15 (06):
  • [23] Machine Learning-Based Cost-Effective Smart Home Data Analysis and Forecasting for Energy Saving
    Park, Sanguk
    BUILDINGS, 2023, 13 (09)
  • [24] Machine learning-based blood pressure estimation using impedance cardiography data
    Bothe, T. L.
    Patzak, A.
    Opatz, O. S.
    Heinz, V.
    Pilz, N.
    ACTA PHYSIOLOGICA, 2025, 241 (02)
  • [25] Software-defined Software: A Perspective of Machine Learning-based Software Production
    Lee, Rubao
    Wang, Hao
    Zhang, Xiaodong
    2018 IEEE 38TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS), 2018, : 1270 - 1275
  • [26] EMPIRICAL COMPARISON AND ANALYSIS OF MACHINE LEARNING-BASED APPROACHES FOR DRUGGABLE PROTEIN IDENTIFICATION
    Shoombuatong, Watshara
    Schaduangrat, Nalini
    Nikom, Jaru
    EXCLI JOURNAL, 2023, 22 : 915 - 927
  • [27] Comparative analysis of multi-source data for machine learning-based LAI estimation in Argania spinosa
    Mouafik, Mohamed
    Fouad, Mounir
    Audet, Felix Antoine
    El Aboudi, Ahmed
    ADVANCES IN SPACE RESEARCH, 2024, 73 (10) : 4976 - 4987
  • [28] Predicting the Accuracy of Machine Learning Algorithms for Software Cost Estimation
    Pareta, Chetana
    Yaadav, N. S.
    Kumar, Ajay
    Sharma, Arvind Kumar
    EMERGING TRENDS IN EXPERT APPLICATIONS AND SECURITY, 2019, 841 : 605 - 615
  • [29] Transparent Data Preprocessing for Machine Learning
    Strasser, Sebastian
    Klettke, Meike
    WORKSHOP ON HUMAN-IN-THE-LOOP DATA ANALYTICS, HILDA 2024, 2024,
  • [30] Systematic Review of Machine Learning-Based Open-Source Software Maintenance Effort Estimation
    Miloudi C.
    Cheikhi L.
    Abran A.
    Recent Advances in Computer Science and Communications, 2023, 16 (03)