Metadata Discovery of Heterogeneous Biomedical Datasets Using Token-Based Features

被引:0
|
作者
Wen, Jingran [1 ]
Gouripeddi, Ramkiran [1 ,2 ]
Facelli, Julio C. [1 ,2 ]
机构
[1] Univ Utah, Dept Biomed Informat, Salt Lake City, UT 84108 USA
[2] Univ Utah, Ctr Clin & Translat Sci, Salt Lake City, UT 84108 USA
来源
基金
美国国家卫生研究院;
关键词
Metadata discovery; Text characterization; Data harmonization;
D O I
10.1007/978-981-10-6451-7_8
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Metadata discovery is the process of recognizing semantics and descriptors of data elements and datasets. This study uses a machine-learning approach to classify biomedical dataset characteristics for metadata discovery. Four common types of biomedical data sources were included in this study genetic variant, protein structure, scientific publications, and general English corpus. Decision tree classification models were built using token-based features derived from these data files. These decision tree classification models are able to identify the four data sources with average F1 scores ranging from 0.935 to 1.000. This study demonstrates that biomedical data of different types have different distributions of token-based document structural features and that such structural features can be leveraged for metadata discovery.
引用
收藏
页码:60 / 67
页数:8
相关论文
共 50 条
  • [31] A platform for transcoding heterogeneous markup documents using ontology-based metadata
    Hsu, I-Ching
    Chi, Li-Pin
    Bor, Sheau-Shong
    JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 2009, 32 (03) : 616 - 629
  • [32] MEvA-X: a hybrid multiobjective evolutionary tool using an XGBoost classifier for biomarkers discovery on biomedical datasets
    Panagiotopoulos, Konstantinos
    Korfiati, Aigli
    Theofilatos, Konstantinos
    Hurwitz, Peter
    Deriu, Marco Agostino
    Mavroudi, Seferina
    BIOINFORMATICS, 2023, 39 (07)
  • [33] BioMap: Gene family based integration of heterogeneous biological databases using AutoMed metadata
    Maibaum, M
    Rimon, G
    Orengo, C
    Martin, N
    Poulovassilis, A
    15TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2004, : 384 - 388
  • [34] Interestingness Hotspot Discovery in Spatial Datasets Using a Graph-Based Approach
    Akdag, Fatih
    Eick, Christoph F.
    MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION (MLDM 2016), 2016, 9729 : 530 - 544
  • [35] Automated Discovery of Anomalous Features in Ultralarge Planetary Remote-Sensing Datasets Using Variational Autoencoders
    Lesnikowski, Adam
    Bickel, Valentin Tertius
    Angerhausen, Daniel
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 6589 - 6600
  • [36] Improving content-based image retrieval for heterogeneous datasets using histogram-based descriptors
    Reta, Carolina
    Solis-Moreno, Ismael
    Cantoral-Ceballos, Jose A.
    Alvarez-Vargas, Rogelio
    Townend, Paul
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (07) : 8163 - 8193
  • [37] Improving content-based image retrieval for heterogeneous datasets using histogram-based descriptors
    Carolina Reta
    Ismael Solis-Moreno
    Jose A. Cantoral-Ceballos
    Rogelio Alvarez-Vargas
    Paul Townend
    Multimedia Tools and Applications, 2018, 77 : 8163 - 8193
  • [38] HICCUP: Hierarchical clustering based value imputation using heterogeneous gene expression Microarray datasets
    Zhao, Qiankun
    Mitra, Prasenjit
    Lee, Doncwon
    Kang, Jaewoo
    PROCEEDINGS OF THE 7TH IEEE INTERNATIONAL SYMPOSIUM ON BIOINFORMATICS AND BIOENGINEERING, VOLS I AND II, 2007, : 71 - 78
  • [39] Augmenting small biomedical datasets using generative AI methods based on self-organizing neural networks
    Ultsch, Alfred
    Loetsch, Joern
    BRIEFINGS IN BIOINFORMATICS, 2024, 26 (01)
  • [40] Contactless Security Token Enhanced Security by Using New Hardware Features in Cryptographic-Based Security Mechanisms
    Ullmann, Markus
    Voegeler, Matthias
    TOWARDS HARDWARE-INTRINSIC SECURITY: FOUNDATIONS AND PRACTICE, 2010, : 259 - 279