Cross-Lingual Information Retrieval from Multilingual Construction Documents Using Pretrained Language Models

被引:2
|
作者
Kim, Jungyeon [1 ]
Chung, Sehwan [1 ]
Chi, Seokho [1 ,2 ]
机构
[1] Seoul Natl Univ, Dept Civil & Environm Engn, Seoul 08826, South Korea
[2] Seoul Natl Univ, Inst Construct & Environm Engn, Seoul 08826, South Korea
基金
新加坡国家研究基金会;
关键词
D O I
10.1061/JCEMD4.COENG-14273
中图分类号
TU [建筑科学];
学科分类号
0813 ;
摘要
The growth of the global construction market has attracted international companies to participate in overseas projects. Overseas projects are extremely dynamic with numerous uncertainties, raising the need to collect information about construction in host countries. Due to the vast amounts of text data in the construction industry, an automated method, specifically information retrieval, is required to find the necessary information. Previous studies have suggested automated methods to review various construction documents. However, these studies required substantial manual effort and mainly focused on only one language, resulting in loss of vital information because it is buried in documents written in the host country's language. To address these limitations, this study proposes a cross-lingual information retrieval (CLIR) framework using pretrained Bidirectional Encoder Representations from Transformers (BERT) models to retrieve information from multilingual construction documents. The proposed framework employs language models (i.e., monolingual, multilingual, and cross-lingual) and trains these models on a construction data set to enhance their ability in construction-specific text. The framework achieved reliable performance of retrieval, even with minimal additional training using domain-specific data. The results indicate that training on the domain data set raises the level of retrieval, increasing the mean reciprocal rank of a specific task by up to 0.2128. With the employment of a monolingual model with machine translation, CLIR in a specific domain could be performed effectively without the need for a labeled data set. The suggested CLIR framework offers a practical alternative for dealing with construction documents in overseas projects, reducing time and cost while improving risk identification and mitigation.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Steering Large Language Models for Cross-lingual Information Retrieval
    Guo, Ping
    Ren, Yubing
    Hu, Yue
    Cao, Yanan
    Li, Yunpeng
    Huang, Heyan
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 585 - 596
  • [2] Exploiting Wikipedia for cross-lingual and multilingual information retrieval
    Sorg, P.
    Cimiano, P.
    DATA & KNOWLEDGE ENGINEERING, 2012, 74 : 26 - 45
  • [3] Code-switching finetuning: Bridging multilingual pretrained language models for enhanced cross-lingual performance
    Zan, Changtong
    Ding, Liang
    Shen, Li
    Cao, Yu
    Liu, Weifeng
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 139
  • [4] Cross-lingual information retrieval using hidden Markov models
    Xu, JX
    Weischedel, R
    PROCEEDINGS OF THE 2000 JOINT SIGDAT CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND VERY LARGE CORPORA, 2000, : 95 - 103
  • [5] Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models
    Qi, Jirui
    Fernandez, Raquel
    Bisazza, Arianna
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 10650 - 10666
  • [6] Using query-relevant documents pairs for cross-lingual information retrieval
    Pinto, David
    Juan, Alfons
    Rosso, Paolo
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2007, 4629 : 630 - 637
  • [7] Unsupervised multilingual machine translation with pretrained cross-lingual encoders
    Shen, Yingli
    Bao, Wei
    Gao, Ge
    Zhou, Maoke
    Zhao, Xiaobing
    KNOWLEDGE-BASED SYSTEMS, 2024, 284
  • [8] On cross-lingual retrieval with multilingual text encoders
    Litschko, Robert
    Vulic, Ivan
    Ponzetto, Simone Paolo
    Glavas, Goran
    INFORMATION RETRIEVAL JOURNAL, 2022, 25 (02): : 149 - 183
  • [9] On cross-lingual retrieval with multilingual text encoders
    Robert Litschko
    Ivan Vulić
    Simone Paolo Ponzetto
    Goran Glavaš
    Information Retrieval Journal, 2022, 25 : 149 - 183
  • [10] Adversarial Domain Adaptation for Cross-lingual Information Retrieval with Multilingual BERT
    Wang, Runchuan
    Zhang, Zhao
    Zhuang, Fuzhen
    Gao, Dehong
    Wei, Yi
    He, Qing
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 3498 - 3502