A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets

被引:17
|
作者
Dabke, Kruttika [1 ,2 ]
Kreimer, Simion [3 ,4 ]
Jones, Michelle R. [1 ]
Parker, Sarah J. [3 ,4 ]
机构
[1] Cedars Sinai Med Ctr, Ctr Bioinformat & Funct Genom, Dept Biomed Sci, Los Angeles, CA 90048 USA
[2] Cedars Sinai Med Ctr, Dept Biomed Sci, Grad Program Biomed Sci, Los Angeles, CA 90048 USA
[3] Cedars Sinai Med Ctr, Adv Clin Biosyst Res Inst, Smidt Heart Inst, Dept Cardiol, Los Angeles, CA 90048 USA
[4] Cedars Sinai Med Ctr, Adv Clin Biosyst Res Inst, Smidt Heart Inst, Dept Biomed Sci, Los Angeles, CA 90048 USA
关键词
DIA-MS; proteomics; missing values; imputation methods; GENE-EXPRESSION DATA; PROTEOGENOMIC CHARACTERIZATION; QUANTITATIVE PROTEOMICS; STATISTICAL-ANALYSIS; QUANTIFICATION; ALIGNMENT; PACKAGE;
D O I
10.1021/acs.jproteome.1c00070
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification level-fragment level-improved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set's most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.
引用
收藏
页码:3214 / 3229
页数:16
相关论文
共 50 条
  • [31] Imputation of missing values in DNA microarray gene expression data
    Kim, H
    Golub, GH
    Park, H
    2004 IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE, PROCEEDINGS, 2004, : 572 - 573
  • [32] Imputation of Missing Values in the Fundamental Data: Using MICE Framework
    Balasubramaniam Meghanadh
    Lagesh Aravalath
    Bhupesh Joshi
    Raghunathan Sathiamoorthy
    Manish Kumar
    Journal of Quantitative Economics, 2019, 17 : 459 - 475
  • [33] A BOOTSTRAP METHOD FOR USING IMPUTATION TECHNIQUES FOR DATA WITH MISSING VALUES
    BELLO, AL
    BIOMETRICAL JOURNAL, 1994, 36 (04) : 453 - 464
  • [34] Missing values imputation in ocean buoy time series data
    Chakraborty, Samarpan
    Ide, Kayo
    Balachandran, Balakumar
    OCEAN ENGINEERING, 2025, 318
  • [35] Impact of imputation of missing values on classification error for discrete data
    Farhangfar, Alireza
    Kurgan, Lukasz
    Dy, Jennifer
    PATTERN RECOGNITION, 2008, 41 (12) : 3692 - 3705
  • [36] Imputation of Missing Values in the Fundamental Data: Using MICE Framework
    Meghanadh, Balasubramaniam
    Aravalath, Lagesh
    Joshi, Bhupesh
    Sathiamoorthy, Raghunathan
    Kumar, Manish
    JOURNAL OF QUANTITATIVE ECONOMICS, 2019, 17 (03) : 459 - 475
  • [37] Imputation of missing values for electronic health record laboratory data
    Li, Jiang
    Yan, Xiaowei S.
    Chaudhary, Durgesh
    Avula, Venkatesh
    Mudiganti, Satish
    Husby, Hannah
    Shahjouei, Shima
    Afshar, Ardavan
    Stewart, Walter F.
    Yeasin, Mohammed
    Zand, Ramin
    Abedi, Vida
    NPJ DIGITAL MEDICINE, 2021, 4 (01)
  • [38] The use of missing values in proteomic data-independent acquisition mass spectrometry to enable disease activity discrimination
    McGurk, Kathryn A.
    Dagliati, Arianna
    Chiasserini, Davide
    Lee, Dave
    Plant, Darren
    Baricevic-Jones, Ivona
    Kelsall, Janet
    Eineman, Rachael
    Reed, Rachel
    Geary, Bethany
    Unwin, Richard
    Nicolaou, Anna
    Keavney, Bernard D.
    Barton, Anne
    Whetton, Anthony D.
    Geifman, Nophar
    BIOINFORMATICS, 2020, 36 (07) : 2217 - 2223
  • [39] Semi-parametric optimization for missing data imputation
    Yongsong Qin
    Shichao Zhang
    Xiaofeng Zhu
    Jilian Zhang
    Chengqi Zhang
    Applied Intelligence, 2007, 27 : 79 - 88
  • [40] Semi-parametric optimization for missing data imputation
    Qin, Yongsong
    Zhang, Shichao
    Zhu, Xiaofeng
    Zhang, Jilian
    Zhang, Chengqi
    APPLIED INTELLIGENCE, 2007, 27 (01) : 79 - 88