Domain-specific language models pre-trained on construction management systems corpora

被引：10

作者：

Zhong, Yunshun ^{[1
]}

Goodfellow, Sebastian D. ^{[1
]}

机构：

[1] Univ Toronto, Dept Civil & Mineral Engn, 35 St George St, Toronto, ON M5S 1A4, Canada

来源：

AUTOMATION IN CONSTRUCTION | 2024年 / 160卷 / 160期

关键词：

Construction management; Domain -specific large language models; Pre; -training; Natural language processing (NLP); Transfer learning; Text classification (TC); Named entity recognition (NER); Corpus development;

D O I：

10.1016/j.autcon.2024.105316

中图分类号：

TU [建筑科学];

学科分类号：

0813 ;

摘要：

The rising demand for automated methods in the Construction Management Systems (CMS) sector highlights opportunities for the Transformer architecture, which enables pre-training Deep Learning models on large, unlabeled datasets for Natural Language Processing (NLP) tasks, outperforming traditional Recurrent Neural Network models. However, their potential in the CMS domain remains underexplored. Therefore, this research produced the first CMS domain corpora from academic papers and introduced an end-to-end pipeline for pretraining and fine-tuning domain-specific Pre-trained Language Models. Four corpora were constructed and transfer learning was employed to pre-train BERT and RoBERTa using the corpora. The best-performing models were then fine-tuned and outperformed models pre-trained on general corpora. In two key NLP tasks, text classification using an infrastructure condition prediction dataset and named entity recognition using an automatic construction control dataset, domain-specific pre-training improved F1 scores by 5.9% and 8.5%, respectively. These promising results demonstrate extended applicability beyond CMS to the Architecture, Engineering, and Construction sectors.

引用

页数：14

共 50 条

[1] SOAP classifier for free-text clinical notes with domain-specific pre-trained language models
de Oliveira, Jezer Machado
Antunes, Rodolfo Stoffel
da Costa, Cristiano Andre
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 245
[2] exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources
Tai, Wen
Kung, H. T.
Dong, Xin
Comiter, Marcus
Kuo, Chang-Fu
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 1433 - 1439
[3] Pre-trained Language Models in Biomedical Domain: A Systematic Survey
Wang, Benyou
Xie, Qianqian
Pei, Jiahuan
Chen, Zhihong
Tiwari, Prayag
Li, Zhao
Fu, Jie
ACM COMPUTING SURVEYS, 2024, 56 (03)
[4] Pre-Trained Language Models and Their Applications
Wang, Haifeng
Li, Jiwei
Wu, Hua
Hovy, Eduard
Sun, Yu
ENGINEERING, 2023, 25 : 51 - 65
[5] Pre-trained language models with domain knowledge for biomedical extractive summarization
Xie Q.
Bishop J.A.
Tiwari P.
Ananiadou S.
Knowledge-Based Systems, 2022, 252
[6] Predicting medical specialty from text based on a domain-specific pre-trained BERT
Kim, Yoojoong
Kim, Jong-Ho
Kim, Young-Min
Song, Sanghoun
Joo, Hyung Joon
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2023, 170
[7] Annotating Columns with Pre-trained Language Models
Suhara, Yoshihiko
Li, Jinfeng
Li, Yuliang
Zhang, Dan
Demiralp, Cagatay
Chen, Chen
Tan, Wang-Chiew
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22), 2022, : 1493 - 1503
[8] IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages
Kakwani, Divyanshu
Kunchukuttan, Anoop
Golla, Satish
Gokul, N. C.
Bhattacharyya, Avik
Khapra, Mitesh M.
Kumar, Pratyush
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 4948 - 4961
[9] LaoPLM: Pre-trained Language Models for Lao
Lin, Nankai
Fu, Yingwen
Yang, Ziyu
Chen, Chuwei
Jiang, Shengyi
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6506 - 6512
[10] Knowledge Rumination for Pre-trained Language Models
Yao, Yunzhi
Wang, Peng
Mao, Shengyu
Tan, Chuanqi
Huang, Fei
Chen, Huajun
Zhang, Ningyu
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 3387 - 3404

← 1 2 3 4 5 →