Auto-generating question-answering datasets with domain-specific knowledge for language models in scientific tasks

被引：0

作者：

Li, Zongqian ^{[1
]}

Cole, Jacqueline M. ^{[1
,2
]}

机构：

[1] Univ Cambridge, Dept Phys, Cavendish Lab, J J Thomson Ave, Cambridge CB3 0HE, England

[2] Harwell Sci & Innovat Campus, ISIS Neutron & Muon Source, Rutherford Appleton Lab, Didcot OX11 0QX, Oxon, England

来源：

DIGITAL DISCOVERY | 2025年

基金：

英国科学技术设施理事会;

关键词：

D O I：

10.1039/d4dd00307a

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Large language models (LLMs) have emerged as a useful tool for the public to process and respond to a vast range of interactive text-based queries. While foundational LLMs are well suited to making general user queries, smaller language models that have been trained on custom text from a specific domain of interest tend to display superior performance on queries about that domain, can operate faster and improve efficiency. Nonetheless, considerable resources are still needed to pre-train a language model with custom data. We present a pipeline that shows a way to overcome this need for pre-training. The pipeline first uses new algorithms that we have designed to produce a large, high-quality question-answering dataset (SCQA) for a particular domain of interest, solar cells. These algorithms employed a solar-cell database that had been auto-generated using the 'chemistry-aware' natural language processing tool, ChemDataExtractor. In turn, this SCQA dataset is used to fine-tune language models, whose resulting F1-scores of performance far exceed (by 10-20%) those of analogous language models that have been fine-tuned against a general-English language QA dataset, SQuAD. Importantly, the performance of the language models fine-tuned against the SCQA dataset does not depend on the size of their architecture, whether or not the tokens were cased or uncased or whether or not the foundational language models were further pre-trained with domain-specific data or fine-tuned directly from their vanilla state. This shows that this domain-specific SCQA dataset produced by our algorithms has sufficient intrinsic domain knowledge to be directly fine-tuned against a foundational language model for immediate use with improved performance.

引用

页数：8

共 17 条

[1] Domain-Specific Question-Answering Systems: A Case Study of a Carbon Neutrality Knowledge Base
Liu, Lei
Zhou, Yongzhang
Ma, Jianhua
Zhang, Yuqing
He, Luhao
SUSTAINABILITY, 2025, 17 (05)
[2] Augmenting general-purpose large-language models with domain-specific multimodal knowledge graph for question-answering in construction project management
Zhou, Shenghua
Liu, Keyan
Li, Dezhi
Fu, Chun
Ning, Yan
Ji, Wenying
Liu, Xuefan
Xiao, Bo
Wei, Ran
ADVANCED ENGINEERING INFORMATICS, 2025, 65
[3] A Bidirectional Question-Answering System using Large Language Models and Knowledge Graphs
Han, Lifan
Wang, Xin
Li, Zhao
Zhang, Heyi
Chen, Zirui
WEB AND BIG DATA, APWEB-WAIM 2023 INTERNATIONAL WORKSHOPS-KGMA 2023 AND SEMIBDMA 2023, 2024, 2094 : 3 - 10
[4] Intelligent Question Answering System Design with Domain-Specific Knowledge Graphs
Zhang, Beining
Zhang, Xile
Wang, Qin
Gui, Guan
Shan, Lin
IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2025, E108A (03) : 546 - 554
[5] Generating Domain-Specific Programs for Diagram Authoring with Large Language Models
Jain, Rijul
Ni, Wode
Sunshine, Joshua
COMPANION PROCEEDINGS OF THE 2023 ACM SIGPLAN INTERNATIONAL CONFERENCE ON SYSTEMS, PROGRAMMING, LANGUAGES, AND APPLICATIONS: SOFTWARE FOR HUMANITY, SPLASH COMPANION 2023, 2023, : 70 - 71
[6] Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question Answering
Yu, Ting
Fu, Kunhao
Wang, Shuhui
Huang, Qingming
Yu, Jun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (02) : 1615 - 1630
[7] Incorporating Domain Knowledge and Semantic Information into Language Models for Commonsense Question Answering
Zhou, Ruiying
Tian, Keke
Lai, Hanjiang
Yin, Jian
PROCEEDINGS OF THE 2021 IEEE 24TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN (CSCWD), 2021, : 1160 - 1165
[8] Large language models as oracles for instantiating ontologies with domain-specific knowledge
Ciatto, Giovanni
Agiollo, Andrea
Magnini, Matteo
Omicini, Andrea
KNOWLEDGE-BASED SYSTEMS, 2025, 310
[9] Research on Engineering Management Question-answering System in the Communication Industry Based on Large Language Models and Knowledge Graphs
Jiang, Yingdi
Yao, Jiarui
Li, Fangfei
Zhang, Yan
PROCEEDINGS OF THE 2024 THE 7TH INTERNATIONAL CONFERENCE ON MACHINE VISION AND APPLICATIONS, ICMVA 2024, 2024, : 100 - 105
[10] ProSLM: A Prolog Synergized Language Model for explainable Domain Specific Knowledge Based Question Answering
Vakharia, Priyesh
Kufeldt, Abigail
Meyers, Max
Lane, Ian
Gilpin, Leilani H.
NEURAL-SYMBOLIC LEARNING AND REASONING, PT II, NESY 2024, 2024, 14980 : 291 - 304

← 1 2 →