Statistical Depth for Ranking and Characterizing Transformer-Based Text Embeddings

被引：0

作者：

Seegmiller, Parker ^{[1
]}

Preum, Sarah Masud ^{[1
]}

机构：

[1] Dartmouth Coll, Dept Comp Sci, Hanover, NH 03755 USA

来源：

2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023) | 2023年

基金：

美国国家科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The popularity of transformer-based text embeddings calls for better statistical tools for measuring distributions of such embeddings. One such tool would be a method for ranking texts within a corpus by centrality, i.e. assigning each text a number signifying how representative that text is of the corpus as a whole. However, an intrinsic center-outward ordering of high-dimensional text representations is not trivial. A statistical depth is a function for ranking k-dimensional objects by measuring centrality with respect to some observed k-dimensional distribution. We adopt a statistical depth to measure distributions of transformer-based text embeddings, transformer-based text embedding (TTE) depth, and introduce the practical use of this depth for both modeling and distributional inference in NLP pipelines. We first define TTE depth and an associated rank sum test for determining whether two corpora differ significantly in embedding space. We then use TTE depth for the task of in-context learning prompt selection, showing that this approach reliably improves performance over statistical baseline approaches across six text classification tasks. Finally, we use TTE depth and the associated rank sum test to characterize the distributions of synthesized and human-generated corpora, showing that five recent synthetic data augmentation processes cause a measurable distributional shift away from associated human-generated text.

引用

页码：9600 / 9611

页数：12

共 50 条

[21] Transformer-Based Composite Language Models for Text Evaluation and Classification
Skoric, Mihailo
Utvic, Milos
Stankovic, Ranka
MATHEMATICS, 2023, 11 (22)
[22] Automatic text summarization using transformer-based language models
Rao, Ritika
Sharma, Sourabh
Malik, Nitin
INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2024, 15 (06) : 2599 - 2605
[23] EXPRESSIVITY TRANSFER IN TRANSFORMER-BASED TEXT-TO-SPEECH SYNTHESIS
Hamed, Mohamed
Lachiri, Zied
2024 IEEE 7TH INTERNATIONAL CONFERENCE ON ADVANCED TECHNOLOGIES, SIGNAL AND IMAGE PROCESSING, ATSIP 2024, 2024, : 443 - 448
[24] RobuTrans: A Robust Transformer-Based Text-to-Speech Model
Li, Naihan
Liu, Yanqing
Wu, Yu
Liu, Shujie
Zhao, Sheng
Liu, Ming
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 8228 - 8235
[25] A transformer-based approach for Arabic offline handwritten text recognition
Saleh Momeni
Bagher BabaAli
Signal, Image and Video Processing, 2024, 18 : 3053 - 3062
[26] Mention Flags (MF): Constraining Transformer-based Text Generators
Wang, Yufei
Wood, Ian D.
Wan, Stephen
Dras, Mark
Johnson, Mark
59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 103 - 113
[27] Compressing Transformer-Based Semantic Parsing Models using Compositional Code Embeddings
Prakash, Prafull
Shashidhar, Saurabh Kumar
Zhao, Wenlong
Rongali, Subendhu
Khan, Haidar
Kayser, Michael
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 4711 - 4717
[28] Improving Transformer-based Sequential Conversational Recommendations through Knowledge Graph Embeddings
Petruzzelli, Alessandro
Martina, Alessandro Francesco Maria
Spillo, Giuseppe
Musto, Cataldo
de Gemmis, Marco
Lops, Pasquale
Semeraro, Giovanni
PROCEEDINGS OF THE 32ND ACM CONFERENCE ON USER MODELING, ADAPTATION AND PERSONALIZATION, UMAP 2024, 2024, : 172 - 182
[29] Transformer-Based Cross-Modal Recipe Embeddings with Large Batch Training
Yang, Jing
Chen, Junwen
Yanai, Keiji
MULTIMEDIA MODELING, MMM 2023, PT II, 2023, 13834 : 471 - 482
[30] Enhancing the accuracy of transformer-based embeddings for sentiment analysis in social big data
Zemzem, Wiem
Tagina, Moncef
INTERNATIONAL JOURNAL OF COMPUTER APPLICATIONS IN TECHNOLOGY, 2023, 73 (03) : 169 - 177

← 1 2 3 4 5 →