ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models

被引:1
|
作者
Feuer, Benjamin [1 ]
Liu, Yurong [1 ]
Hegde, Chinmay [1 ]
Freire, Juliana [1 ]
机构
[1] NYU, New York, NY 10016 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2024年 / 17卷 / 09期
关键词
D O I
10.14778/3665844.3665857
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type; incur high run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark.
引用
收藏
页码:2279 / 2292
页数:14
相关论文
共 50 条
  • [31] Assessing Completeness of Clinical Histories Accompanying Imaging Orders Using Adapted Open-Source and Closed-Source Large Language Models
    Larson, David B.
    Koirala, Arogya
    Cheuy, Lina Y.
    Paschali, Magdalini
    Van Veen, Dave
    Na, Hye Sun
    Petterson, Matthew B.
    Fang, Zhongnan
    Chaudhari, Akshay S.
    RADIOLOGY, 2025, 314 (02)
  • [32] Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking
    Zhuang, Shengyao
    Liu, Bing
    Koopman, Bevan
    Zuccon, Guido
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 8807 - 8817
  • [33] An open-source framework for large-scale transient topology optimization using PETSc
    Kristiansen, Hansotto
    Aage, Niels
    STRUCTURAL AND MULTIDISCIPLINARY OPTIMIZATION, 2022, 65 (10)
  • [34] An open-source framework for large-scale transient topology optimization using PETSc
    Hansotto Kristiansen
    Niels Aage
    Structural and Multidisciplinary Optimization, 2022, 65
  • [35] Open-Source Large Language Models in Anesthesia Perioperative Medicine: ASA-Physical Status Evaluation
    Rouholiman, Dara
    Goodell, Alex J.
    Fung, Ethan
    Chandrasoma, Janak T.
    Chu, Larry F.
    ANESTHESIA AND ANALGESIA, 2024, 139 (06): : 2779 - 2781
  • [36] TeenyTinyLlama: Open-source tiny language models trained in Brazilian Portuguese
    Correa, Nicholas Kluge
    Falk, Sophia
    Fatimah, Shiza
    Sen, Aniket
    De Oliveira, Nythamar
    MACHINE LEARNING WITH APPLICATIONS, 2024, 16
  • [37] OPEN-SOURCE LANGUAGE AI CHALLENGES BIG TECH'S MODELS
    Gibney, Elizabeth
    NATURE, 2022, 606 (7916) : 850 - 851
  • [38] Open-source language AI challenges big tech’s models
    Elizabeth Gibney
    Nature, 2022, 606 : 850 - 851
  • [39] Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study
    Ehrett, Carl
    Hegde, Sudeep
    Andre, Kwame
    Liu, Dixizi
    Wilson, Timothy
    JMIR MEDICAL EDUCATION, 2024, 10
  • [40] FaultLines - Evaluating the Efficacy of Open-Source Large Language Models for Fault Detection in Cyber-Physical Systems
    Muehlburger, Herbert
    Wotawa, Franz
    2024 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE TESTING, AITEST, 2024, : 47 - 54