Statistical Depth for Ranking and Characterizing Transformer-Based Text Embeddings

被引：0

作者：

Seegmiller, Parker ^{[1
]}

Preum, Sarah Masud ^{[1
]}

机构：

[1] Dartmouth Coll, Dept Comp Sci, Hanover, NH 03755 USA

来源：

2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023) | 2023年

基金：

美国国家科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The popularity of transformer-based text embeddings calls for better statistical tools for measuring distributions of such embeddings. One such tool would be a method for ranking texts within a corpus by centrality, i.e. assigning each text a number signifying how representative that text is of the corpus as a whole. However, an intrinsic center-outward ordering of high-dimensional text representations is not trivial. A statistical depth is a function for ranking k-dimensional objects by measuring centrality with respect to some observed k-dimensional distribution. We adopt a statistical depth to measure distributions of transformer-based text embeddings, transformer-based text embedding (TTE) depth, and introduce the practical use of this depth for both modeling and distributional inference in NLP pipelines. We first define TTE depth and an associated rank sum test for determining whether two corpora differ significantly in embedding space. We then use TTE depth for the task of in-context learning prompt selection, showing that this approach reliably improves performance over statistical baseline approaches across six text classification tasks. Finally, we use TTE depth and the associated rank sum test to characterize the distributions of synthesized and human-generated corpora, showing that five recent synthetic data augmentation processes cause a measurable distributional shift away from associated human-generated text.

引用

页码：9600 / 9611

页数：12

共 50 条

[1] Development of a Text Classification Framework using Transformer-based Embeddings
Yeasmin, Sumona
Afrin, Nazia
Saif, Kashfia
Huq, Mohammad Rezwanul
PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON DATA SCIENCE, TECHNOLOGY AND APPLICATIONS (DATA), 2022, : 74 - 82
[2] Assessing the Effectiveness of Multilingual Transformer-based Text Embeddings for Named Entity Recognition in Portuguese
de Lima Santos, Diego Bernardes
de Carvalho Dutra, Frederico Giffoni
Parreiras, Fernando Silva
Brandao, Wladmir Cardoso
PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS (ICEIS 2021), VOL 1, 2021, : 473 - 483
[3] Transformer-based Text Detection in the Wild
Raisi, Zobeir
Naiel, Mohamed A.
Younes, Georges
Wardell, Steven
Zelek, John S.
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 3156 - 3165
[4] TIRec: Transformer-based Invoice Text Recognition
Chen, Yanlan
2023 2ND ASIA CONFERENCE ON ALGORITHMS, COMPUTING AND MACHINE LEARNING, CACML 2023, 2023, : 175 - 180
[5] Practical Transformer-based Multilingual Text Classification
Wang, Cindy
Banko, Michele
2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, NAACL-HLT 2021, 2021, : 121 - 129
[6] A Transformer-Based Framework for Scene Text Recognition
Selvam, Prabu
Koilraj, Joseph Abraham Sundar
Tavera Romero, Carlos Andres
Alharbi, Meshal
Mehbodniya, Abolfazl
Webber, Julian L.
Sengan, Sudhakar
IEEE ACCESS, 2022, 10 : 100895 - 100910
[7] LMMS reloaded: Transformer-based sense embeddings for disambiguation and beyond
Loureiro, Daniel
Camacho-Collados, Jose
Jorge, Alipio Mario
ARTIFICIAL INTELLIGENCE, 2022, 305
[8] Transformer-based Question Text Generation in the Learning System
Li, Jiajun
Song, Huazhu
Li, Jun
6TH INTERNATIONAL CONFERENCE ON INNOVATION IN ARTIFICIAL INTELLIGENCE, ICIAI2022, 2022, : 50 - 56
[9] Applying Transformer-Based Text Summarization for Keyphrase Generation
Glazkova A.V.
Morozov D.A.
Lobachevskii Journal of Mathematics, 2023, 44 (1) : 123 - 136
[10] Video text tracking with transformer-based local search
Zhou, Xingsheng
Wang, Cheng
Wang, Xinggang
Liu, Wenyu
NEUROCOMPUTING, 2024, 609

← 1 2 3 4 5 →