ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models

被引:1
|
作者
Feuer, Benjamin [1 ]
Liu, Yurong [1 ]
Hegde, Chinmay [1 ]
Freire, Juliana [1 ]
机构
[1] NYU, New York, NY 10016 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2024年 / 17卷 / 09期
关键词
D O I
10.14778/3665844.3665857
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type; incur high run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark.
引用
收藏
页码:2279 / 2292
页数:14
相关论文
共 50 条
  • [1] Re: Open-Source Large Language Models in Radiology
    Kooraki, Soheil
    Bedayat, Arash
    ACADEMIC RADIOLOGY, 2024, 31 (10) : 4293 - 4293
  • [2] Servicing open-source large language models for oncology
    Ray, Partha Pratim
    ONCOLOGIST, 2024,
  • [3] A tutorial on open-source large language models for behavioral science
    Hussain, Zak
    Binz, Marcel
    Mata, Rui
    Wulff, Dirk U.
    BEHAVIOR RESEARCH METHODS, 2024, 56 (08) : 8214 - 8237
  • [4] Upgrading Academic Radiology with Open-Source Large Language Models
    Ray, Partha Pratim
    ACADEMIC RADIOLOGY, 2024, 31 (10) : 4291 - 4292
  • [5] EAI-SIM: An Open-source Embodied AI Simulation Framework with Large Language Models
    Liu, Guocai
    Sun, Tao
    Li, Weihua
    Li, Xiaohui
    Liu, Xin
    Cui, Jinqiang
    2024 IEEE 18TH INTERNATIONAL CONFERENCE ON CONTROL & AUTOMATION, ICCA 2024, 2024, : 994 - 999
  • [6] Staged Multi-Strategy Framework With Open-Source Large Language Models for Natural Language to SQL Generation
    Liu, Chuanlong
    Liao, Wei
    Xu, Zhen
    IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, 2025,
  • [7] ZAP: An Open-Source Multilingual Annotation Projection Framework
    Akbik, Alan
    Vollgraf, Roland
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 2180 - 2184
  • [8] Preliminary Systematic Review of Open-Source Large Language Models in Education
    Lin, Michael Pin-Chuan
    Chang, Daniel
    Hall, Sarah
    Jhajj, Gaganpreet
    GENERATIVE INTELLIGENCE AND INTELLIGENT TUTORING SYSTEMS, PT I, ITS 2024, 2024, 14798 : 68 - 77
  • [9] Classifying Cancer Stage with Open-Source Clinical Large Language Models
    Chang, Chia-Hsuan
    Lucas, Mary M.
    Lu-Yao, Grace
    Yang, Christopher C.
    2024 IEEE 12TH INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS, ICHI 2024, 2024, : 76 - 82
  • [10] Comparison of Frontier Open-Source and Proprietary Large Language Models for Complex Diagnoses
    Buckley, Thomas A.
    Crowe, Byron
    Abdulnour, Raja-Elie E.
    Rodman, Adam
    Manrai, Arjun K.
    JAMA HEALTH FORUM, 2025, 6 (03):