A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model

被引:52
|
作者
Judson, Richard [1 ]
Elloumi, Fathi [1 ]
Setzer, R. Woodrow [1 ]
Li, Zhen [2 ]
Shah, Imran [1 ]
机构
[1] US EPA, Natl Ctr Computat Toxicol, Off Res & Dev, Res Triangle Pk, NC 27711 USA
[2] Univ N Carolina, Dept Biostat, Chapel Hill, NC 27599 USA
关键词
D O I
10.1186/1471-2105-9-241
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Bioactivity profiling using high-throughput in vitro assays can reduce the cost and time required for toxicological screening of environmental chemicals and can also reduce the need for animal testing. Several public efforts are aimed at discovering patterns or classifiers in high-dimensional bioactivity space that predict tissue, organ or whole animal toxicological endpoints. Supervised machine learning is a powerful approach to discover combinatorial relationships in complex in vitro/in vivo datasets. We present a novel model to simulate complex chemical-toxicology data sets and use this model to evaluate the relative performance of different machine learning (ML) methods. Results: The classification performance of Artificial Neural Networks (ANN), K-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), Naive Bayes (NB), Recursive Partitioning and Regression Trees (RPART), and Support Vector Machines (SVM) in the presence and absence of filter-based feature selection was analyzed using K-way cross-validation testing and independent validation on simulated in vitro assay data sets with varying levels of model complexity, number of irrelevant features and measurement noise. While the prediction accuracy of all ML methods decreased as non-causal (irrelevant) features were added, some ML methods performed better than others. In the limit of using a large number of features, ANN and SVM were always in the top performing set of methods while RPART and KNN (k = 5) were always in the poorest performing set. The addition of measurement noise and irrelevant features decreased the classification accuracy of all ML methods, with LDA suffering the greatest performance degradation. LDA performance is especially sensitive to the use of feature selection. Filter-based feature selection generally improved performance, most strikingly for LDA. Conclusion: We have developed a novel simulation model to evaluate machine learning methods for the analysis of data sets in which in vitro bioassay data is being used to predict in vivo chemical toxicology. From our analysis, we can recommend that several ML methods, most notably SVM and ANN, are good candidates for use in real world applications in this area.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model
    Richard Judson
    Fathi Elloumi
    R Woodrow Setzer
    Zhen Li
    Imran Shah
    BMC Bioinformatics, 9
  • [2] Multi-Scale Vehicle Classification Using Different Machine Learning Models
    Roxas, Edison A.
    Vicerra, Ryan Rhay P.
    Lim, Laurence A. Gan
    Dela Cruz, Jennifer C.
    Naguib, Raouf
    Dadios, Elmer P.
    Bandala, Argel A.
    2018 IEEE 10TH INTERNATIONAL CONFERENCE ON HUMANOID, NANOTECHNOLOGY, INFORMATION TECHNOLOGY, COMMUNICATION AND CONTROL, ENVIRONMENT AND MANAGEMENT (HNICEM), 2018,
  • [3] Comparison of Machine Learning Algorithms in Data classification
    ul Hassan, Ch Anwar
    Khan, Muhammad Sufyan
    Shah, Munam Ali
    2018 24TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATION AND COMPUTING (ICAC' 18), 2018, : 270 - 275
  • [4] MULTI-SCALE MACHINE LEARNING FOR THE CLASSIFICATION OF BUILDING PROPERTY VALUES
    Helber, Patrick
    Bischke, Benjamin
    Guo, Qiushi
    Hees, Joern
    Dengel, Andreas
    2019 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2019), 2019, : 4873 - 4876
  • [5] Classification of Logging Data Using Machine Learning Algorithms
    Mukhamediev, Ravil
    Kuchin, Yan
    Yunicheva, Nadiya
    Kalpeyeva, Zhuldyz
    Muhamedijeva, Elena
    Gopejenko, Viktors
    Rystygulov, Panabek
    APPLIED SCIENCES-BASEL, 2024, 14 (17):
  • [6] Comparison of machine learning algorithms for classification of Big Data sets
    Singh, Barkha
    Indu, Sreedevi
    Majumdar, Sudipta
    THEORETICAL COMPUTER SCIENCE, 2025, 1024
  • [7] A novel multi-scale loss function for classification problems in machine learning
    Berlyand, Leonid
    Creese, Robert
    Jabin, Pierre-Emmanuel
    JOURNAL OF COMPUTATIONAL PHYSICS, 2024, 498
  • [8] Comparison of Machine Learning Algorithms and Fruit Classification using Orange Data Mining Tool
    Vaishnav, Devashree
    Rao, B. Rama
    PROCEEDINGS OF THE 2018 3RD INTERNATIONAL CONFERENCE ON INVENTIVE COMPUTATION TECHNOLOGIES (ICICT 2018), 2018, : 603 - 607
  • [9] Hyperspectral Image Denoising and Classification Using Multi-Scale Weighted EMAPs and Extreme Learning Machine
    Liu, Meizhuang
    Cao, Faxian
    Yang, Zhijing
    Hong, Xiaobin
    Huang, Yuezhen
    ELECTRONICS, 2020, 9 (12) : 1 - 17
  • [10] Interpretable multi-morphology and multi-scale microalgae classification based on machine learning
    Yan, Huchao
    Peng, Xinggan
    Wang, Chao
    Xia, Ao
    Huang, Yun
    Zhu, Xianqing
    Zhang, Jingmiao
    Zhu, Xun
    Liao, Qiang
    ALGAL RESEARCH-BIOMASS BIOFUELS AND BIOPRODUCTS, 2024, 84