BioWordVec, improving biomedical word embeddings with subword information and MeSH

被引:248
|
作者
Zhang, Yijia [1 ,2 ]
Chen, Qingyu [1 ]
Yang, Zhihao [2 ]
Lin, Hongfei [2 ]
Lu, Zhiyong [1 ]
机构
[1] NIH, NCBI, NLM, Bethesda, MD 20894 USA
[2] Dalian Univ Technol, Sch Comp Sci & Technol, Dalian 116023, Liaoning, Peoples R China
关键词
DRUG INTERACTION EXTRACTION;
D O I
10.1038/s41597-019-0055-0
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Distributed word representations have become an essential foundation for biomedical natural language processing (BioNLP), text mining and information retrieval. Word embeddings are traditionally computed at the word level from a large corpus of unlabeled text, ignoring the information present in the internal structure of words or any information available in domain specific structured resources such as ontologies. However, such information holds potentials for greatly improving the quality of the word representation, as suggested in some recent studies in the general domain. Here we present BioWordVec: an open set of biomedical word vectors/embeddings that combines subword information from unlabeled biomedical text with a widely-used biomedical controlled vocabulary called Medical Subject Headings (MeSH). We assess both the validity and utility of our generated word embeddings over multiple NLP tasks in the biomedical domain. Our benchmarking results demonstrate that our word embeddings can result in significantly improved performance over the previous state of the art in those challenging tasks.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] BioWordVec, improving biomedical word embeddings with subword information and MeSH
    Yijia Zhang
    Qingyu Chen
    Zhihao Yang
    Hongfei Lin
    Zhiyong Lu
    Scientific Data, 6
  • [2] Improving Word Embeddings with Convolutional Feature Learning and Subword Information
    Cao, Shaosheng
    Lu, Wei
    THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3144 - 3151
  • [3] Improving Biomedical Information Extraction with Word Embeddings Trained on Closed-Domain Corpora
    Silvestri, Stefano
    Gargiulo, Francesco
    Ciampi, Mario
    2019 IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (ISCC), 2019, : 1129 - 1134
  • [4] On the Impact of the Length of Subword Vectors on Word Embeddings
    Cai, Xiangrui
    Luo, Yonghong
    Zhang, Ying
    Yuan, Xiaojie
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, 2019, 11448 : 495 - 499
  • [5] Subword-based Compact Reconstruction of Word Embeddings
    Sasaki, Shota
    Suzuki, Jun
    Inui, Kentaro
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 3498 - 3508
  • [6] Biomedical Word Sense Disambiguation with Word Embeddings
    Antunes, Rui
    Matos, Sergio
    11TH INTERNATIONAL CONFERENCE ON PRACTICAL APPLICATIONS OF COMPUTATIONAL BIOLOGY & BIOINFORMATICS, 2017, 616 : 273 - 279
  • [7] Improving bilingual word embeddings mapping with monolingual context information
    Zhu, Shaolin
    Mi, Chenggang
    Li, Tianqi
    Zhang, Fuhua
    Zhang, Zhifeng
    Sun, Yu
    MACHINE TRANSLATION, 2021, 35 (04) : 503 - 518
  • [8] Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations
    Chaudhary, Aditi
    Zhou, Chunting
    Levin, Lori
    Neubig, Graham
    Mortensen, David R.
    Carbonell, Jaime G.
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 3285 - 3295
  • [10] Improving Unsupervised Acoustic Word Embeddings using Speaker and Gender Information
    van Staden, Lisa
    Kamper, Herman
    2020 INTERNATIONAL SAUPEC/ROBMECH/PRASA CONFERENCE, 2020, : 533 - 538