ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models

被引:1
|
作者
Feuer, Benjamin [1 ]
Liu, Yurong [1 ]
Hegde, Chinmay [1 ]
Freire, Juliana [1 ]
机构
[1] NYU, New York, NY 10016 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2024年 / 17卷 / 09期
关键词
D O I
10.14778/3665844.3665857
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type; incur high run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark.
引用
收藏
页码:2279 / 2292
页数:14
相关论文
共 50 条
  • [21] Automatic structuring of radiology reports with on-premise open-source large language models
    Woznicki, Piotr
    Laqua, Caroline
    Fiku, Ina
    Hekalo, Amar
    Truhn, Daniel
    Engelhardt, Sandy
    Kather, Jakob
    Foersch, Sebastian
    D'Antonoli, Tugba Akinci
    dos Santos, Daniel Pinto
    Baessler, Bettina
    Laqua, Fabian Christopher
    EUROPEAN RADIOLOGY, 2025, 35 (04) : 2018 - 2029
  • [22] Iterative Refactoring of Real-World Open-Source Programs with Large Language Models
    Choi, Jinsu
    An, Gabin
    Yoo, Shin
    SEARCH-BASED SOFTWARE ENGINEERING, SSBSE 2024, 2024, 14767 : 49 - 55
  • [23] Evaluation of Open-Source Large Language Models for Metal-Organic Frameworks Research
    Bai, Xuefeng
    Xie, Yabo
    Zhang, Xin
    Han, Honggui
    Li, Jian-Rong
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2024, 64 (13) : 4958 - 4965
  • [24] Fine-Tuning and Evaluating Open-Source Large Language Models for the Army Domain
    Ruiz, Maj Daniel C.
    Sell, John
    arXiv,
  • [25] Comparing Commercial and Open-Source Large Language Models for Labeling Chest Radiograph Reports
    Dorfner, Felix J.
    Juergensen, Liv
    Donle, Leonhard
    Al Mohamad, Fares
    Bodenmann, Tobias R.
    Cleveland, Mason C.
    Busch, Felix
    Adams, Lisa C.
    Sato, James
    Schultz, Thomas
    Kim, Albert E.
    Merkow, Jameson
    Bressem, Keno K.
    Bridge, Christopher P.
    RADIOLOGY, 2024, 313 (01)
  • [26] The pureCMusic (pCM++) framework as open-source music language
    Tarabella, L
    COMPUTER MUSIC MODELING AND RETRIEVAL, 2006, 3902 : 34 - 44
  • [27] Benchmarking open-source large language models on Portuguese Revalida multiple-choice questions
    Severino, Joao Victor Bruneti
    de Paula, Pedro Angelo Basei
    Berger, Matheus Nespolo
    Loures, Filipe Silveira
    Todeschini, Solano Amadori
    Roeder, Eduardo Augusto
    Veiga, Maria Han
    Guedes, Murilo
    Marques, Gustavo Lenci
    BMJ HEALTH & CARE INFORMATICS, 2025, 32 (01)
  • [28] Analyzing Women's Contributions to Open-Source Software Projects based on Large Language Models
    Zhuang, Yuqian
    Zhang, Mingya
    Yang, Yiyuan
    Wang, Liang
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 2363 - 2368
  • [29] Need of Fine-Tuned Radiology Aware Open-Source Large Language Models for Neuroradiology
    Ray, Partha Pratim
    CLINICAL NEURORADIOLOGY, 2024,
  • [30] Toponym resolution leveraging lightweight and open-source large language models and geo-knowledge
    Hu, Xuke
    Kersten, Jens
    Klan, Friederike
    Farzana, Sheikh Mastura
    INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE, 2024,