PCA and PLS with very large data sets

被引:139
|
作者
Kettaneh, N
Berglund, A
Wold, S [1 ]
机构
[1] Umea Univ, Res Grp Chemometr, S-90187 Umea, Sweden
[2] Umetrics Inc, Kinnelon, NJ 07405 USA
关键词
principal components; PLS; data mining; data preprocessing; clustering;
D O I
10.1016/j.csda.2003.11.027
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Chemometrics was started around 30 years ago to cope with the rapidly increasing volumes of data produced in chemical laboratories. A multivariate approach based on projections-PCA and PLS-was developed that adequately solved many of the problems at hand. However, with the further increase in the size of our data sets seen today in all fields of science and technology, we start to see inadequacies in our multivariate methods, both in their efficiency and interpretability. Starting from a few examples of complicated problems seen in RD&P (research, development, and production), possible extensions and generalizations of the existing multivariate projection methods-PCA and PLS-will be discussed. Criteria such as scalability of methods to increasing size of problems and data, increasing sophistication in the handling of noise and non-linearities, interpretability of results, and relative simplicity of use, will be held as important. The discussion will be made from a perspective of the evolution of scientific methodology as (a) driven by new technology, e.g., computers and graphical displays, and the need to answer some always reoccurring and basic questions, and (b) constrained by the limitations of the human brain, i.e., our ability to understand and interpret scientific and data analytic results. (C) 2003 Elsevier B.V. All rights reserved.
引用
收藏
页码:69 / 85
页数:17
相关论文
共 50 条
  • [1] PCA for large data sets with parallel data summarization
    Ordonez, Carlos
    Mohanam, Naveen
    Garcia-Alvarado, Carlos
    DISTRIBUTED AND PARALLEL DATABASES, 2014, 32 (03) : 377 - 403
  • [2] PCA for large data sets with parallel data summarization
    Carlos Ordonez
    Naveen Mohanam
    Carlos Garcia-Alvarado
    Distributed and Parallel Databases, 2014, 32 : 377 - 403
  • [3] Data mining from extreme data sets: Very large and/or very skewed data sets
    Hall, LO
    2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 2555 - 2555
  • [4] Joining very large data sets
    Johnson, T
    Chatziantoniou, D
    DATABASES IN TELECOMMUNICATIONS, 2000, 1819 : 118 - 132
  • [5] DPLS and PPLS:: two PLS algorithms for large data sets
    Milidiú, RL
    Rentería, RP
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2005, 48 (01) : 125 - 138
  • [6] Clustering Very Large Dissimilarity Data Sets
    Hammer, Barbara
    Hasenfuss, Alexander
    ARTIFICIAL NEURAL NETWORKS IN PATTERN RECOGNITION, PROCEEDINGS, 2010, 5998 : 259 - +
  • [7] Managing very large distributed data sets on a data grid
    Branco, Miguel
    Zaluska, Ed
    de Roure, David
    Lassnig, Mario
    Garonne, Vincent
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2010, 22 (11): : 1338 - 1364
  • [8] A clustering method for very large mixed data sets
    Sánchez-Díaz, G
    Ruiz-Shulcloper, J
    2001 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2001, : 643 - 644
  • [9] Decision tree learning on very large data sets
    Hall, LO
    Chawla, N
    Bowyer, KW
    1998 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5, 1998, : 2579 - 2584
  • [10] Phase Unwrapping for Very Large Interferometric Data Sets
    Zhang, Kui
    Ge, Linlin
    Hu, Zhe
    Alex Hay-Man Ng
    Li, Xiaojing
    Rizos, Chris
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2011, 49 (10): : 4048 - 4061