PMLB v1.0: an open-source dataset collection for benchmarking machine learning methods

被引:14
|
作者
Romano, Joseph D. [1 ,2 ]
Le, Trang T. [1 ]
La Cava, William [1 ]
Gregg, John T. [1 ]
Goldberg, Daniel J. [3 ]
Chakraborty, Praneel [4 ,5 ]
Ray, Natasha L. [6 ]
Himmelstein, Daniel [7 ,8 ]
Fu, Weixuan [1 ]
Moore, Jason H. [1 ]
机构
[1] Univ Penn, Inst Biomed Informat, Philadelphia, PA 19104 USA
[2] Univ Penn, Ctr Excellence Environm Toxicol, Philadelphia, PA 19104 USA
[3] Washington Univ, Dept Comp Sci & Engn, St Louis, MO 63130 USA
[4] Univ Penn, Sch Arts & Sci, Philadelphia, PA 19104 USA
[5] Univ Penn, Wharton Sch, Philadelphia, PA 19104 USA
[6] Princeton Day Sch, Princeton, NJ 08540 USA
[7] Related Sci, Denver, CO 80220 USA
[8] Univ Penn, Dept Syst Pharmacol & Translat Therapeut, Philadelphia, PA 19104 USA
基金
美国国家卫生研究院;
关键词
D O I
10.1093/bioinformatics/btab727
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Novel machine learning and statistical modeling studies rely on standardized comparisons to existing methods using well-studied benchmark datasets. Few tools exist that provide rapid access to many of these datasets through a standardized, user-friendly interface that integrates well with popular data science workflows. Results: This release of PMLB (Penn Machine Learning Benchmarks) provides the largest collection of diverse, public benchmark datasets for evaluating new machine learning and data science methods aggregated in one location. v1.0 introduces a number of critical improvements developed following discussions with the open-source community.
引用
收藏
页码:878 / 880
页数:3
相关论文
共 50 条
  • [31] Machine Learning for Perovskite Solar Cells: An Open-Source Pipeline
    Roberts, Nicholas
    Jones, Dylan
    Schuy, Alex
    Hsu, Shi-Chieh
    Lin, Lih Y.
    ADVANCED PHYSICS RESEARCH, 2024, 3 (11):
  • [32] AGATHA: Face Benchmarking Dataset for Exploring Criminal Surveillance methods on Open Source Data
    Brito, Paulo
    Fontes, Joao Pedro
    Miquelina, Nuno
    Guevara, Miguel Angel
    2018 1ST INTERNATIONAL CONFERENCE ON GRAPHICS AND INTERACTION (ICGI 2018), 2018,
  • [33] Open-source benchmarking of IBD segment detection methods for biobank-scale cohorts
    Tang, Kecong
    Naseri, Ardalan
    Wei, Yuan
    Zhang, Shaojie
    Zhi, Degui
    GIGASCIENCE, 2022, 11
  • [34] CircuitNet: An Open-Source Dataset for Machine Learning in VLSI CAD Applications With Improved Domain-Specific Evaluation Metric and Learning Strategies
    Chai, Zhuomin
    Zhao, Yuxiang
    Liu, Wei
    Lin, Yibo
    Wang, Runsheng
    Huang, Ru
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2023, 42 (12) : 5034 - 5047
  • [35] MLAir (v1.0) - a tool to enable fast and flexible machine learning on air data time series
    Leufen, Lukas Hubert
    Kleinert, Felix
    Schultz, Martin G.
    GEOSCIENTIFIC MODEL DEVELOPMENT, 2021, 14 (03) : 1553 - 1574
  • [36] An open-source machine learning framework for global analyses of parton distributions
    Ball, Richard D.
    Carrazza, Stefano
    Cruz-Martinez, Juan
    Del Debbio, Luigi
    Forte, Stefano
    Giani, Tommaso
    Iranipour, Shayan
    Kassabov, Zahari
    Latorre, Jose, I
    Nocera, Emanuele R.
    Pearson, Rosalyn L.
    Rojo, Juan
    Stegeman, Roy
    Schwan, Christopher
    Ubiali, Maria
    Voisey, Cameron
    Wilson, Michael
    EUROPEAN PHYSICAL JOURNAL C, 2021, 81 (10):
  • [37] Picasso: An Open-Source Machine Learning Schema for Annotating Images in Hematology
    Dhillon, Vikram
    Balasubramanian, Suresh Kumar
    BLOOD, 2022, 140 : 7854 - 7855
  • [38] An open-source machine learning framework for global analyses of parton distributions
    Richard D. Ball
    Stefano Carrazza
    Juan Cruz-Martinez
    Luigi Del Debbio
    Stefano Forte
    Tommaso Giani
    Shayan Iranipour
    Zahari Kassabov
    Jose I. Latorre
    Emanuele R. Nocera
    Rosalyn L. Pearson
    Juan Rojo
    Roy Stegeman
    Christopher Schwan
    Maria Ubiali
    Cameron Voisey
    Michael Wilson
    The European Physical Journal C, 2021, 81
  • [39] Building Forecasting Solutions Using Open-Source and Azure Machine Learning
    Hu, Chenhui
    Paunic, Vanja
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 3497 - 3498
  • [40] Optimizing Leak Detection in Open-source Platforms with Machine Learning Techniques
    Lounici, Sofiane
    Rosa, Marco
    Negri, Carlo Maria
    Trabelsi, Slim
    Onen, Melek
    ICISSP: PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS SECURITY AND PRIVACY, 2021, : 145 - 159