Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models

被引:0
|
作者
Wang, Ruida [1 ]
Zhou, Wangchunshu [2 ]
Sachan, Mrinmaya [3 ]
机构
[1] HKUST, Hong Kong, Peoples R China
[2] AIWaves Inc, Cardiff, Wales
[3] Swiss Fed Inst Technol, Zurich, Switzerland
来源
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023) | 2023年
基金
瑞士国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data Synthesis is a promising way to train a small model with very little labeled data. One approach for data synthesis is to leverage the rich knowledge from large language models to synthesize pseudo training examples for small models, making it possible to achieve both data and compute efficiency at the same time. However, a key challenge in data synthesis is that the synthesized dataset often suffers from a large distributional discrepancy from the real task data distribution. Thus, in this paper, we propose Synthesis Step by Step (S3), a data synthesis framework that shrinks this distribution gap by iteratively extrapolating the errors made by a small model trained on the synthesized dataset on a small real-world validation dataset using a large language model. Extensive experiments on multiple NLP tasks show that our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data, resulting in significant improvement compared to several baselines: 9.48% improvement compared to ZeroGen, 2.73% compared to GoldGen, and 15.17% improvement compared to the small model trained on human-annotated data.(1)
引用
收藏
页码:11817 / 11831
页数:15
相关论文
共 43 条
  • [21] Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation
    Zhong, Shanshan
    Huang, Zhon Than
    Gao, Shanghua
    Wen, Wushao
    Lin, Liang
    Zitnik, Marinka
    Zhou, Pan
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13246 - 13257
  • [22] PrISM-Q&A: Step-Aware Voice Assistant on a Smartwatch Enabled by Multimodal Procedure Tracking and Large Language Models
    Arakawa, Riku
    Lehman, Jill Fain
    Goel, Mayank
    PROCEEDINGS OF THE ACM ON INTERACTIVE MOBILE WEARABLE AND UBIQUITOUS TECHNOLOGIES-IMWUT, 2024, 8 (04):
  • [23] Alzheimer's disease recognition from spontaneous speech using large language models
    Bang, Jeong-Uk
    Han, Seung-Hoon
    Kang, Byung-Ok
    ETRI JOURNAL, 2024, 46 (01) : 96 - 105
  • [24] LARGE LANGUAGE MODELS TO EXTRACT INFORMATION ON SUICIDE FROM CHILDREN'S MEDICAL RECORDS
    Edgcomb, Juliet B.
    Saha, Angshuman
    Lee, Joshua J.
    Ponce, Chrislie G.
    Tascione, Elyse M.
    Montero, Alanna E.
    Ryan, Neal D.
    JOURNAL OF THE AMERICAN ACADEMY OF CHILD AND ADOLESCENT PSYCHIATRY, 2024, 63 (10): : S175 - S175
  • [25] Self-chats from Large Language Models Make Small Emotional Support Chatbot Better
    Zheng, Zhonghua
    Liao, Lizi
    Deng, Yang
    Qin, Libo
    Nie, Liqiang
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 11325 - 11345
  • [26] Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset
    Aria Y. Wang
    Kendrick Kay
    Thomas Naselaris
    Michael J. Tarr
    Leila Wehbe
    Nature Machine Intelligence, 2023, 5 : 1415 - 1426
  • [27] Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset
    Wang, Aria Y.
    Kay, Kendrick
    Naselaris, Thomas
    Tarr, Michael J.
    Wehbe, Leila
    NATURE MACHINE INTELLIGENCE, 2023, 5 (12) : 1415 - 1426
  • [28] From promise to practice: challenges and pitfalls in the evaluation of large language models for data extraction in evidence synthesis
    Gartlehner, Gerald
    Kahwati, Leila
    Nussbaumer-Streit, Barbara
    Crotty, Karen
    Hilscher, Rainer
    Kugley, Shannon
    Viswanathan, Meera
    Thomas, Ian
    Konet, Amanda
    Booth, Graham
    Chew, Robert
    BMJ EVIDENCE-BASED MEDICINE, 2024,
  • [29] Translation Performance from the User's Perspective of Large Language Models and Neural Machine Translation Systems
    Son, Jungha
    Kim, Boyoung
    INFORMATION, 2023, 14 (10)
  • [30] Step into the era of large multimodal models: a pilot study on ChatGPT-4V(ision)'s ability to interpret radiological images
    Zhu, Lingxuan
    Mou, Weiming
    Lai, Yancheng
    Chen, Jinghong
    Lin, Shujia
    Xu, Liling
    Lin, Junda
    Guo, Zeji
    Yang, Tao
    Lin, Anqi
    Qi, Chang
    Gan, Ling
    Zhang, Jian
    Luo, Peng
    INTERNATIONAL JOURNAL OF SURGERY, 2024, 110 (07) : 4096 - 4102