MaterialBERT for natural language processing of materials science texts

被引:15
|
作者
Yoshitake, Michiko [1 ]
Sato, Fumitaka [1 ,2 ]
Kawano, Hiroyuki [1 ,2 ]
Teraoka, Hiroshi [1 ,2 ]
机构
[1] Natl Inst Mat Sci, MaDIS, 1-1 Namiki, Tsukuba, Ibaraki 3050044, Japan
[2] Ridgelinez, Business Sci Unit, Tokyo, Japan
来源
SCIENCE AND TECHNOLOGY OF ADVANCED MATERIALS-METHODS | 2022年 / 2卷 / 01期
关键词
Word embedding; pre-training; BERT; literal information;
D O I
10.1080/27660400.2022.2124831
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
A BERT (Bidirectional Encoder Representations from Transformers) model, which we named "MaterialBERT", has been generated using scientific papers in wide area of material science as a corpus. A new vocabulary list for tokenizer was generated using material science corpus. Two BERT models with different vocabulary lists for the tokenizer, one with the original one made by Google and the other newly made by the authors, were generated. Word vectors embedded during the pre-training with the two MaterialBERT models reasonably reflect the meanings of materials names in material-class clustering and in the relationship between base materials and their compounds or derivatives for not only inorganic materials but also organic materials and organometallic compounds. Fine-tuning with CoLA (The Corpus of Linguistic Acceptability) using the pre-trained MaterialBERT showed a higher score than the original BERT. The two MaterialBERTs could be also utilized as a starting point for transfer learning of a narrower domain-specific BERT. [GRAPHICS]
引用
收藏
页码:372 / 380
页数:9
相关论文
共 50 条
  • [1] Natural Language Processing for Historical Texts
    Rosmorduc, Serge
    TRAITEMENT AUTOMATIQUE DES LANGUES, 2012, 53 (03): : 155 - 157
  • [2] Natural Language Processing for Historical Texts
    Romary, Laurent
    COMPUTATIONAL LINGUISTICS, 2014, 40 (01) : 231 - 233
  • [3] Natural language processing of mathematical texts in mArachna
    Blanke, Marie
    Jeschke, Sabina
    Natho, Nicole
    Seiler, Ruedi
    Wilke, Marc
    ADVANCES AND INNOVATIONS IN SYSTEMS, COMPUTING SCIENCES AND SOFTWARE ENGINEERING, 2007, : 301 - 305
  • [4] Zonal morphological processing of natural language texts
    Shlepakov, L.N.
    Shlepakov, D.V.
    Kibernetika i Sistemnyj Analiz, 2001, (03): : 28 - 35
  • [5] Quantitative Topic Analysis of Materials Science Literature Using Natural Language Processing
    Choi, Jaewoong
    Lee, Byungju
    ACS APPLIED MATERIALS & INTERFACES, 2023, 16 (02) : 1957 - 1968
  • [6] Natural Language Processing in Diagnostic Texts from Nephropathology
    Legnar, Maximilian
    Daumke, Philipp
    Hesser, Juergen
    Porubsky, Stefan
    Popovic, Zoran
    Bindzus, Jan Niklas
    Siemoneit, Joern-Helge Heinrich
    Weis, Cleo-Aron
    DIAGNOSTICS, 2022, 12 (07)
  • [7] Natural language processing and cognitive science: Foreword
    Sharp, Bernadette
    Zock, Michael
    Natural Language Processing and Cognitive Science - Proceedings of the 6th International Workshop on Natural Language Processing and Cognitive Science - NLPCS 2009 In Conjunction with ICEIS 2009, 2009,
  • [8] NATURAL-LANGUAGE PROCESSING AND SEMANTICAL REPRESENTATION OF MEDICAL TEXTS
    BAUD, RH
    RASSINOUX, AM
    SCHERRER, JR
    METHODS OF INFORMATION IN MEDICINE, 1992, 31 (02) : 117 - 125
  • [9] Literature classification and its applications in condensed matter physics and materials science by natural language processing
    Wu, Siyuan
    Zhu, Tiannian
    Tu, Sijia
    Xiao, Ruijuan
    Yuan, Jie
    Wu, Quansheng
    Li, Hong
    Weng, Hongming
    CHINESE PHYSICS B, 2024, 33 (05)
  • [10] Literature classification and its applications in condensed matter physics and materials science by natural language processing
    吴思远
    朱天念
    涂思佳
    肖睿娟
    袁洁
    吴泉生
    李泓
    翁红明
    Chinese Physics B, 2024, 33 (05) : 131 - 137