ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models

被引：1

作者：

Feuer, Benjamin ^{[1
]}

Liu, Yurong ^{[1
]}

Hegde, Chinmay ^{[1
]}

Freire, Juliana ^{[1
]}

机构：

[1] NYU, New York, NY 10016 USA

来源：

PROCEEDINGS OF THE VLDB ENDOWMENT | 2024年 / 17卷 / 09期

关键词：

D O I：

10.14778/3665844.3665857

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type; incur high run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark.

引用

页码：2279 / 2292

页数：14

共 50 条

[1] Re: Open-Source Large Language Models in Radiology
Kooraki, Soheil
Bedayat, Arash
ACADEMIC RADIOLOGY, 2024, 31 (10) : 4293 - 4293
[2] Servicing open-source large language models for oncology
Ray, Partha Pratim
ONCOLOGIST, 2024,
[3] A tutorial on open-source large language models for behavioral science
Hussain, Zak
Binz, Marcel
Mata, Rui
Wulff, Dirk U.
BEHAVIOR RESEARCH METHODS, 2024, 56 (08) : 8214 - 8237
[4] Upgrading Academic Radiology with Open-Source Large Language Models
Ray, Partha Pratim
ACADEMIC RADIOLOGY, 2024, 31 (10) : 4291 - 4292
[5] EAI-SIM: An Open-source Embodied AI Simulation Framework with Large Language Models
Liu, Guocai
Sun, Tao
Li, Weihua
Li, Xiaohui
Liu, Xin
Cui, Jinqiang
2024 IEEE 18TH INTERNATIONAL CONFERENCE ON CONTROL & AUTOMATION, ICCA 2024, 2024, : 994 - 999
[6] Staged Multi-Strategy Framework With Open-Source Large Language Models for Natural Language to SQL Generation
Liu, Chuanlong
Liao, Wei
Xu, Zhen
IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, 2025,
[7] ZAP: An Open-Source Multilingual Annotation Projection Framework
Akbik, Alan
Vollgraf, Roland
PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 2180 - 2184
[8] Preliminary Systematic Review of Open-Source Large Language Models in Education
Lin, Michael Pin-Chuan
Chang, Daniel
Hall, Sarah
Jhajj, Gaganpreet
GENERATIVE INTELLIGENCE AND INTELLIGENT TUTORING SYSTEMS, PT I, ITS 2024, 2024, 14798 : 68 - 77
[9] Classifying Cancer Stage with Open-Source Clinical Large Language Models
Chang, Chia-Hsuan
Lucas, Mary M.
Lu-Yao, Grace
Yang, Christopher C.
2024 IEEE 12TH INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS, ICHI 2024, 2024, : 76 - 82
[10] Comparison of Frontier Open-Source and Proprietary Large Language Models for Complex Diagnoses
Buckley, Thomas A.
Crowe, Byron
Abdulnour, Raja-Elie E.
Rodman, Adam
Manrai, Arjun K.
JAMA HEALTH FORUM, 2025, 6 (03):

← 1 2 3 4 5 →