A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model

被引:52
|
作者
Judson, Richard [1 ]
Elloumi, Fathi [1 ]
Setzer, R. Woodrow [1 ]
Li, Zhen [2 ]
Shah, Imran [1 ]
机构
[1] US EPA, Natl Ctr Computat Toxicol, Off Res & Dev, Res Triangle Pk, NC 27711 USA
[2] Univ N Carolina, Dept Biostat, Chapel Hill, NC 27599 USA
关键词
D O I
10.1186/1471-2105-9-241
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Bioactivity profiling using high-throughput in vitro assays can reduce the cost and time required for toxicological screening of environmental chemicals and can also reduce the need for animal testing. Several public efforts are aimed at discovering patterns or classifiers in high-dimensional bioactivity space that predict tissue, organ or whole animal toxicological endpoints. Supervised machine learning is a powerful approach to discover combinatorial relationships in complex in vitro/in vivo datasets. We present a novel model to simulate complex chemical-toxicology data sets and use this model to evaluate the relative performance of different machine learning (ML) methods. Results: The classification performance of Artificial Neural Networks (ANN), K-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), Naive Bayes (NB), Recursive Partitioning and Regression Trees (RPART), and Support Vector Machines (SVM) in the presence and absence of filter-based feature selection was analyzed using K-way cross-validation testing and independent validation on simulated in vitro assay data sets with varying levels of model complexity, number of irrelevant features and measurement noise. While the prediction accuracy of all ML methods decreased as non-causal (irrelevant) features were added, some ML methods performed better than others. In the limit of using a large number of features, ANN and SVM were always in the top performing set of methods while RPART and KNN (k = 5) were always in the poorest performing set. The addition of measurement noise and irrelevant features decreased the classification accuracy of all ML methods, with LDA suffering the greatest performance degradation. LDA performance is especially sensitive to the use of feature selection. Filter-based feature selection generally improved performance, most strikingly for LDA. Conclusion: We have developed a novel simulation model to evaluate machine learning methods for the analysis of data sets in which in vitro bioassay data is being used to predict in vivo chemical toxicology. From our analysis, we can recommend that several ML methods, most notably SVM and ANN, are good candidates for use in real world applications in this area.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] A Survey on Evolutionary Machine learning algorithms for Multi-Dimensional Data classification
    Swapna, C.
    Shaji, R. S.
    2015 INTERNATIONAL CONFERENCE ON CONTROL, INSTRUMENTATION, COMMUNICATION AND COMPUTATIONAL TECHNOLOGIES (ICCICCT), 2015, : 781 - 785
  • [32] Traffic Data Classification using Machine Learning Algorithms in SDN Networks
    Kwon, Jungmin
    Jung, Daeun
    Park, Hyunggon
    11TH INTERNATIONAL CONFERENCE ON ICT CONVERGENCE: DATA, NETWORK, AND AI IN THE AGE OF UNTACT (ICTC 2020), 2020, : 1031 - 1033
  • [33] Medical Data Clustering and Classification Using TLBO and Machine Learning Algorithms
    Dubey, Ashutosh Kumar
    Gupta, Umesh
    Jain, Sonal
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 70 (03): : 4523 - 4543
  • [34] Classification of Road Traffic Accident Data Using Machine Learning Algorithms
    Kumeda, Bulbula
    Zhang, Fengli
    Zhou, Fan
    Hussain, Sadiq
    Almasri, Ammar
    Assefa, Maregu
    2019 IEEE 11TH INTERNATIONAL CONFERENCE ON COMMUNICATION SOFTWARE AND NETWORKS (ICCSN 2019), 2019, : 682 - 687
  • [35] Classification of Cardiovascular Risk Using Accelerometer Data and Machine Learning Algorithms
    Boiarskaia, Elena
    Liang, Feng
    Zhu, Weimo
    MEDICINE AND SCIENCE IN SPORTS AND EXERCISE, 2014, 46 (05): : 717 - 717
  • [36] CLASSIFICATION OF FACIAL EXPRESSIONS USING DATA MINING AND MACHINE LEARNING ALGORITHMS
    Faria, Brigida Monica
    Lau, Nuno
    Reis, Luis Paulo
    SISTEMAS E TECHNOLOGIAS DE INFORMACAO: ACTAS DA 4A CONFERENCIA IBERICA DE SISTEMAS E TECNOLOGIAS DE LA INFORMACAO, 2009, : 197 - +
  • [37] Hierarchical Learning Algorithms for Multi-scale Expert Problems
    Yang, Lin
    Chen, Yu-Zhen Janice
    Hajiesmaili, Mohammad H.
    Herbster, Mark
    Towsley, Don
    PROCEEDINGS OF THE ACM ON MEASUREMENT AND ANALYSIS OF COMPUTING SYSTEMS, 2022, 6 (02)
  • [38] Development of a Multi-Scale Groundwater Drought Prediction Model Using Deep Learning and Hydrometeorological Data
    Kang, Dayoung
    Byun, Kyuhyun
    WATER, 2024, 16 (14)
  • [39] Machine learning algorithms using binary classification and multi model ensemble techniques for skin diseases prediction
    Chaurasia, Vikas
    Pal, Saurabh
    INTERNATIONAL JOURNAL OF BIOMEDICAL ENGINEERING AND TECHNOLOGY, 2020, 34 (01) : 57 - 74
  • [40] Machine learning-assisted multi-scale modeling
    Weinan, E.
    Lei, Huan
    Xie, Pinchen
    Zhang, Linfeng
    JOURNAL OF MATHEMATICAL PHYSICS, 2023, 64 (07)