A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets

被引：17

作者：

Dabke, Kruttika ^{[1
,2
]}

Kreimer, Simion ^{[3
,4
]}

Jones, Michelle R. ^{[1
]}

Parker, Sarah J. ^{[3
,4
]}

机构：

[1] Cedars Sinai Med Ctr, Ctr Bioinformat & Funct Genom, Dept Biomed Sci, Los Angeles, CA 90048 USA

[2] Cedars Sinai Med Ctr, Dept Biomed Sci, Grad Program Biomed Sci, Los Angeles, CA 90048 USA

[3] Cedars Sinai Med Ctr, Adv Clin Biosyst Res Inst, Smidt Heart Inst, Dept Cardiol, Los Angeles, CA 90048 USA

[4] Cedars Sinai Med Ctr, Adv Clin Biosyst Res Inst, Smidt Heart Inst, Dept Biomed Sci, Los Angeles, CA 90048 USA

来源：

JOURNAL OF PROTEOME RESEARCH | 2021年 / 20卷 / 06期

关键词：

DIA-MS; proteomics; missing values; imputation methods; GENE-EXPRESSION DATA; PROTEOGENOMIC CHARACTERIZATION; QUANTITATIVE PROTEOMICS; STATISTICAL-ANALYSIS; QUANTIFICATION; ALIGNMENT; PACKAGE;

D O I：

10.1021/acs.jproteome.1c00070

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification level-fragment level-improved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set's most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.

引用

页码：3214 / 3229

页数：16

共 50 条

[41] Missing Data Imputation based on Unsupervised Simple Competitive Learning
Lee, Byoung Jik
PROCEEDINGS OF THE 9TH WSEAS INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, KNOWLEDGE ENGINEERING AND DATA BASES, 2010, : 292 - +
[42] Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies
Lazar, Cosmin
Gatto, Laurent
Ferro, Myriam
Bruley, Christophe
Burger, Thomas
JOURNAL OF PROTEOME RESEARCH, 2016, 15 (04) : 1116 - 1125
[43] A Comparison of Various Imputation Methods for Missing Values in Air Quality Data
Zainuri, Nuryazmin Ahmat
Jemain, Abdul Aziz
Muda, Nora
SAINS MALAYSIANA, 2015, 44 (03): : 449 - 456
[44] Imputation techniques on missing values in breast cancer treatment and fertility data
Wu, Xuetong
Akbarzadeh Khorshidi, Hadi
Aickelin, Uwe
Edib, Zobaida
Peate, Michelle
HEALTH INFORMATION SCIENCE AND SYSTEMS, 2019, 7 (01)
[45] Imputation Strategies for Clustering Mixed-Type Data with Missing Values
Rabea Aschenbruck
Gero Szepannek
Adalbert F. X. Wilhelm
Journal of Classification, 2023, 40 : 2 - 24
[46] Computational Methods for Data Integration and Imputation of Missing Values in Omics Datasets
Schumann, Yannis
Gocke, Antonia
Neumann, Julia E.
PROTEOMICS, 2025, 25 (1-2)
[47] Missing Values Imputation Using Genetic Algorithm for the Analysis of Traffic Data
Midde, Ranjit Reddy
Srinivasa, K. G.
Reddy, Eswara B.
ARTIFICIAL INTELLIGENCE AND EVOLUTIONARY COMPUTATIONS IN ENGINEERING SYSTEMS, ICAIECES 2017, 2018, 668 : 251 - 261
[48] Lazy Collaborative Filtering for Data Sets With Missing Values
Ren, Yongli
Li, Gang
Zhang, Jun
Zhou, Wanlei
IEEE TRANSACTIONS ON CYBERNETICS, 2013, 43 (06) : 1822 - 1834
[49] Distance estimation in numerical data sets with missing values
Eirola, Emil
Doquire, Gauthier
Verleysen, Michel
Lendasse, Amaury
INFORMATION SCIENCES, 2013, 240 : 115 - 128
[50] Imputation Strategies for Clustering Mixed-Type Data with Missing Values
Aschenbruck, Rabea
Szepannek, Gero
Wilhelm, Adalbert F. X.
JOURNAL OF CLASSIFICATION, 2023, 40 (01) : 2 - 24

← 1 2 3 4 5 →