Auto-generating question-answering datasets with domain-specific knowledge for language models in scientific tasks

被引:0
|
作者
Li, Zongqian [1 ]
Cole, Jacqueline M. [1 ,2 ]
机构
[1] Univ Cambridge, Dept Phys, Cavendish Lab, J J Thomson Ave, Cambridge CB3 0HE, England
[2] Harwell Sci & Innovat Campus, ISIS Neutron & Muon Source, Rutherford Appleton Lab, Didcot OX11 0QX, Oxon, England
来源
基金
英国科学技术设施理事会;
关键词
D O I
10.1039/d4dd00307a
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Large language models (LLMs) have emerged as a useful tool for the public to process and respond to a vast range of interactive text-based queries. While foundational LLMs are well suited to making general user queries, smaller language models that have been trained on custom text from a specific domain of interest tend to display superior performance on queries about that domain, can operate faster and improve efficiency. Nonetheless, considerable resources are still needed to pre-train a language model with custom data. We present a pipeline that shows a way to overcome this need for pre-training. The pipeline first uses new algorithms that we have designed to produce a large, high-quality question-answering dataset (SCQA) for a particular domain of interest, solar cells. These algorithms employed a solar-cell database that had been auto-generated using the 'chemistry-aware' natural language processing tool, ChemDataExtractor. In turn, this SCQA dataset is used to fine-tune language models, whose resulting F1-scores of performance far exceed (by 10-20%) those of analogous language models that have been fine-tuned against a general-English language QA dataset, SQuAD. Importantly, the performance of the language models fine-tuned against the SCQA dataset does not depend on the size of their architecture, whether or not the tokens were cased or uncased or whether or not the foundational language models were further pre-trained with domain-specific data or fine-tuned directly from their vanilla state. This shows that this domain-specific SCQA dataset produced by our algorithms has sufficient intrinsic domain knowledge to be directly fine-tuned against a foundational language model for immediate use with improved performance.
引用
收藏
页数:8
相关论文
共 17 条
  • [1] Domain-Specific Question-Answering Systems: A Case Study of a Carbon Neutrality Knowledge Base
    Liu, Lei
    Zhou, Yongzhang
    Ma, Jianhua
    Zhang, Yuqing
    He, Luhao
    SUSTAINABILITY, 2025, 17 (05)
  • [2] Augmenting general-purpose large-language models with domain-specific multimodal knowledge graph for question-answering in construction project management
    Zhou, Shenghua
    Liu, Keyan
    Li, Dezhi
    Fu, Chun
    Ning, Yan
    Ji, Wenying
    Liu, Xuefan
    Xiao, Bo
    Wei, Ran
    ADVANCED ENGINEERING INFORMATICS, 2025, 65
  • [3] A Bidirectional Question-Answering System using Large Language Models and Knowledge Graphs
    Han, Lifan
    Wang, Xin
    Li, Zhao
    Zhang, Heyi
    Chen, Zirui
    WEB AND BIG DATA, APWEB-WAIM 2023 INTERNATIONAL WORKSHOPS-KGMA 2023 AND SEMIBDMA 2023, 2024, 2094 : 3 - 10
  • [4] Intelligent Question Answering System Design with Domain-Specific Knowledge Graphs
    Zhang, Beining
    Zhang, Xile
    Wang, Qin
    Gui, Guan
    Shan, Lin
    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2025, E108A (03) : 546 - 554
  • [5] Generating Domain-Specific Programs for Diagram Authoring with Large Language Models
    Jain, Rijul
    Ni, Wode
    Sunshine, Joshua
    COMPANION PROCEEDINGS OF THE 2023 ACM SIGPLAN INTERNATIONAL CONFERENCE ON SYSTEMS, PROGRAMMING, LANGUAGES, AND APPLICATIONS: SOFTWARE FOR HUMANITY, SPLASH COMPANION 2023, 2023, : 70 - 71
  • [6] Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question Answering
    Yu, Ting
    Fu, Kunhao
    Wang, Shuhui
    Huang, Qingming
    Yu, Jun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (02) : 1615 - 1630
  • [7] Incorporating Domain Knowledge and Semantic Information into Language Models for Commonsense Question Answering
    Zhou, Ruiying
    Tian, Keke
    Lai, Hanjiang
    Yin, Jian
    PROCEEDINGS OF THE 2021 IEEE 24TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN (CSCWD), 2021, : 1160 - 1165
  • [8] Large language models as oracles for instantiating ontologies with domain-specific knowledge
    Ciatto, Giovanni
    Agiollo, Andrea
    Magnini, Matteo
    Omicini, Andrea
    KNOWLEDGE-BASED SYSTEMS, 2025, 310
  • [9] Research on Engineering Management Question-answering System in the Communication Industry Based on Large Language Models and Knowledge Graphs
    Jiang, Yingdi
    Yao, Jiarui
    Li, Fangfei
    Zhang, Yan
    PROCEEDINGS OF THE 2024 THE 7TH INTERNATIONAL CONFERENCE ON MACHINE VISION AND APPLICATIONS, ICMVA 2024, 2024, : 100 - 105
  • [10] ProSLM: A Prolog Synergized Language Model for explainable Domain Specific Knowledge Based Question Answering
    Vakharia, Priyesh
    Kufeldt, Abigail
    Meyers, Max
    Lane, Ian
    Gilpin, Leilani H.
    NEURAL-SYMBOLIC LEARNING AND REASONING, PT II, NESY 2024, 2024, 14980 : 291 - 304