MaterialBERT for natural language processing of materials science texts

被引:15
|
作者
Yoshitake, Michiko [1 ]
Sato, Fumitaka [1 ,2 ]
Kawano, Hiroyuki [1 ,2 ]
Teraoka, Hiroshi [1 ,2 ]
机构
[1] Natl Inst Mat Sci, MaDIS, 1-1 Namiki, Tsukuba, Ibaraki 3050044, Japan
[2] Ridgelinez, Business Sci Unit, Tokyo, Japan
来源
SCIENCE AND TECHNOLOGY OF ADVANCED MATERIALS-METHODS | 2022年 / 2卷 / 01期
关键词
Word embedding; pre-training; BERT; literal information;
D O I
10.1080/27660400.2022.2124831
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
A BERT (Bidirectional Encoder Representations from Transformers) model, which we named "MaterialBERT", has been generated using scientific papers in wide area of material science as a corpus. A new vocabulary list for tokenizer was generated using material science corpus. Two BERT models with different vocabulary lists for the tokenizer, one with the original one made by Google and the other newly made by the authors, were generated. Word vectors embedded during the pre-training with the two MaterialBERT models reasonably reflect the meanings of materials names in material-class clustering and in the relationship between base materials and their compounds or derivatives for not only inorganic materials but also organic materials and organometallic compounds. Fine-tuning with CoLA (The Corpus of Linguistic Acceptability) using the pre-trained MaterialBERT showed a higher score than the original BERT. The two MaterialBERTs could be also utilized as a starting point for transfer learning of a narrower domain-specific BERT. [GRAPHICS]
引用
收藏
页码:372 / 380
页数:9
相关论文
共 50 条
  • [31] The language of classifying in introductory science texts
    Darian, S
    JOURNAL OF PRAGMATICS, 1997, 27 (06) : 815 - 839
  • [32] An architecture for language processing for scientific texts
    Copestake, Ann
    Corbett, Peter
    Murray-Rust, Peter
    Rupp, C. J.
    Siddharthan, Advaith
    Teufel, Simone
    Waldron, Ben
    PROCEEDINGS OF THE UK E-SCIENCE ALL HANDS MEETING 2006, 2006, : 614 - 621
  • [33] Extracting Intrauterine Device Usage from Clinical Texts using Natural Language Processing
    Shi, Jianlin
    Mowery, Danielle
    Chapman, Wendy
    Zhang, Mingyuan
    Sanders, Jessica
    Gawron, Lori
    2017 IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI), 2017, : 568 - 571
  • [34] Towards a Methodology for Comparing Legal Texts Based on Semantic, Storytelling and Natural Language Processing
    Graziano, Mariangela
    Di Martino, Beniamino
    Cante, Luigi Colucci
    Esposito, Antonio
    Lupi, Pietro
    COMPLEX, INTELLIGENT AND SOFTWARE INTENSIVE SYSTEMS, CISIS-2024, 2024, 87 : 343 - 352
  • [35] Introduction for artificial intelligence and law: special issue “natural language processing for legal texts”
    Livio Robaldo
    Serena Villata
    Adam Wyner
    Matthias Grabmair
    Artificial Intelligence and Law, 2019, 27 : 113 - 115
  • [36] Introduction for artificial intelligence and law: special issue "natural language processing for legal texts"
    Robaldo, Livio
    Villata, Serena
    Wyner, Adam
    Grabmair, Matthias
    ARTIFICIAL INTELLIGENCE AND LAW, 2019, 27 (02) : 113 - 115
  • [37] Determining Of Semantically Close Texts Of Stock Market News Using Natural Language Processing
    Bosacheva, Tatiana
    Magomedov, Shamil
    Lebedev, Artem
    INTERNATIONAL CONFERENCE ON ELECTRICAL, COMPUTER AND ENERGY TECHNOLOGIES (ICECET 2021), 2021, : 871 - 875
  • [38] Natural language processing in mental health applications using non-clinical texts
    Calvo, Rafael A.
    Milne, David N.
    Hussain, M. Sazzad
    Christensen, Helen
    NATURAL LANGUAGE ENGINEERING, 2017, 23 (05) : 649 - 685
  • [39] Automatically Extractino Procedural Knowledge from Instructional Texts using Natural Language Processing
    Zhang, Ziqi
    Webster, Philip
    Uren, Victoria
    Varga, Andrea
    Ciravegna, Fabio
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 520 - 527
  • [40] Polysemy in Controlled Natural Language Texts
    Gruzitis, Normunds
    Barzdins, Guntis
    CONTROLLED NATURAL LANGUAGE, 2010, 5972 : 102 - 120