Domain-specific language models pre-trained on construction management systems corpora

被引:10
|
作者
Zhong, Yunshun [1 ]
Goodfellow, Sebastian D. [1 ]
机构
[1] Univ Toronto, Dept Civil & Mineral Engn, 35 St George St, Toronto, ON M5S 1A4, Canada
关键词
Construction management; Domain -specific large language models; Pre; -training; Natural language processing (NLP); Transfer learning; Text classification (TC); Named entity recognition (NER); Corpus development;
D O I
10.1016/j.autcon.2024.105316
中图分类号
TU [建筑科学];
学科分类号
0813 ;
摘要
The rising demand for automated methods in the Construction Management Systems (CMS) sector highlights opportunities for the Transformer architecture, which enables pre-training Deep Learning models on large, unlabeled datasets for Natural Language Processing (NLP) tasks, outperforming traditional Recurrent Neural Network models. However, their potential in the CMS domain remains underexplored. Therefore, this research produced the first CMS domain corpora from academic papers and introduced an end-to-end pipeline for pretraining and fine-tuning domain-specific Pre-trained Language Models. Four corpora were constructed and transfer learning was employed to pre-train BERT and RoBERTa using the corpora. The best-performing models were then fine-tuned and outperformed models pre-trained on general corpora. In two key NLP tasks, text classification using an infrastructure condition prediction dataset and named entity recognition using an automatic construction control dataset, domain-specific pre-training improved F1 scores by 5.9% and 8.5%, respectively. These promising results demonstrate extended applicability beyond CMS to the Architecture, Engineering, and Construction sectors.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] SOAP classifier for free-text clinical notes with domain-specific pre-trained language models
    de Oliveira, Jezer Machado
    Antunes, Rodolfo Stoffel
    da Costa, Cristiano Andre
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 245
  • [2] exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources
    Tai, Wen
    Kung, H. T.
    Dong, Xin
    Comiter, Marcus
    Kuo, Chang-Fu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 1433 - 1439
  • [3] Pre-trained Language Models in Biomedical Domain: A Systematic Survey
    Wang, Benyou
    Xie, Qianqian
    Pei, Jiahuan
    Chen, Zhihong
    Tiwari, Prayag
    Li, Zhao
    Fu, Jie
    ACM COMPUTING SURVEYS, 2024, 56 (03)
  • [4] Pre-Trained Language Models and Their Applications
    Wang, Haifeng
    Li, Jiwei
    Wu, Hua
    Hovy, Eduard
    Sun, Yu
    ENGINEERING, 2023, 25 : 51 - 65
  • [5] Pre-trained language models with domain knowledge for biomedical extractive summarization
    Xie Q.
    Bishop J.A.
    Tiwari P.
    Ananiadou S.
    Knowledge-Based Systems, 2022, 252
  • [6] Predicting medical specialty from text based on a domain-specific pre-trained BERT
    Kim, Yoojoong
    Kim, Jong-Ho
    Kim, Young-Min
    Song, Sanghoun
    Joo, Hyung Joon
    INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2023, 170
  • [7] Annotating Columns with Pre-trained Language Models
    Suhara, Yoshihiko
    Li, Jinfeng
    Li, Yuliang
    Zhang, Dan
    Demiralp, Cagatay
    Chen, Chen
    Tan, Wang-Chiew
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22), 2022, : 1493 - 1503
  • [8] IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages
    Kakwani, Divyanshu
    Kunchukuttan, Anoop
    Golla, Satish
    Gokul, N. C.
    Bhattacharyya, Avik
    Khapra, Mitesh M.
    Kumar, Pratyush
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 4948 - 4961
  • [9] LaoPLM: Pre-trained Language Models for Lao
    Lin, Nankai
    Fu, Yingwen
    Yang, Ziyu
    Chen, Chuwei
    Jiang, Shengyi
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6506 - 6512
  • [10] Knowledge Rumination for Pre-trained Language Models
    Yao, Yunzhi
    Wang, Peng
    Mao, Shengyu
    Tan, Chuanqi
    Huang, Fei
    Chen, Huajun
    Zhang, Ningyu
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 3387 - 3404