Statistical Depth for Ranking and Characterizing Transformer-Based Text Embeddings

被引:0
|
作者
Seegmiller, Parker [1 ]
Preum, Sarah Masud [1 ]
机构
[1] Dartmouth Coll, Dept Comp Sci, Hanover, NH 03755 USA
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The popularity of transformer-based text embeddings calls for better statistical tools for measuring distributions of such embeddings. One such tool would be a method for ranking texts within a corpus by centrality, i.e. assigning each text a number signifying how representative that text is of the corpus as a whole. However, an intrinsic center-outward ordering of high-dimensional text representations is not trivial. A statistical depth is a function for ranking k-dimensional objects by measuring centrality with respect to some observed k-dimensional distribution. We adopt a statistical depth to measure distributions of transformer-based text embeddings, transformer-based text embedding (TTE) depth, and introduce the practical use of this depth for both modeling and distributional inference in NLP pipelines. We first define TTE depth and an associated rank sum test for determining whether two corpora differ significantly in embedding space. We then use TTE depth for the task of in-context learning prompt selection, showing that this approach reliably improves performance over statistical baseline approaches across six text classification tasks. Finally, we use TTE depth and the associated rank sum test to characterize the distributions of synthesized and human-generated corpora, showing that five recent synthetic data augmentation processes cause a measurable distributional shift away from associated human-generated text.
引用
收藏
页码:9600 / 9611
页数:12
相关论文
共 50 条
  • [21] Transformer-Based Composite Language Models for Text Evaluation and Classification
    Skoric, Mihailo
    Utvic, Milos
    Stankovic, Ranka
    MATHEMATICS, 2023, 11 (22)
  • [22] Automatic text summarization using transformer-based language models
    Rao, Ritika
    Sharma, Sourabh
    Malik, Nitin
    INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2024, 15 (06) : 2599 - 2605
  • [23] EXPRESSIVITY TRANSFER IN TRANSFORMER-BASED TEXT-TO-SPEECH SYNTHESIS
    Hamed, Mohamed
    Lachiri, Zied
    2024 IEEE 7TH INTERNATIONAL CONFERENCE ON ADVANCED TECHNOLOGIES, SIGNAL AND IMAGE PROCESSING, ATSIP 2024, 2024, : 443 - 448
  • [24] RobuTrans: A Robust Transformer-Based Text-to-Speech Model
    Li, Naihan
    Liu, Yanqing
    Wu, Yu
    Liu, Shujie
    Zhao, Sheng
    Liu, Ming
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 8228 - 8235
  • [25] A transformer-based approach for Arabic offline handwritten text recognition
    Saleh Momeni
    Bagher BabaAli
    Signal, Image and Video Processing, 2024, 18 : 3053 - 3062
  • [26] Mention Flags (MF): Constraining Transformer-based Text Generators
    Wang, Yufei
    Wood, Ian D.
    Wan, Stephen
    Dras, Mark
    Johnson, Mark
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 103 - 113
  • [27] Compressing Transformer-Based Semantic Parsing Models using Compositional Code Embeddings
    Prakash, Prafull
    Shashidhar, Saurabh Kumar
    Zhao, Wenlong
    Rongali, Subendhu
    Khan, Haidar
    Kayser, Michael
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 4711 - 4717
  • [28] Improving Transformer-based Sequential Conversational Recommendations through Knowledge Graph Embeddings
    Petruzzelli, Alessandro
    Martina, Alessandro Francesco Maria
    Spillo, Giuseppe
    Musto, Cataldo
    de Gemmis, Marco
    Lops, Pasquale
    Semeraro, Giovanni
    PROCEEDINGS OF THE 32ND ACM CONFERENCE ON USER MODELING, ADAPTATION AND PERSONALIZATION, UMAP 2024, 2024, : 172 - 182
  • [29] Transformer-Based Cross-Modal Recipe Embeddings with Large Batch Training
    Yang, Jing
    Chen, Junwen
    Yanai, Keiji
    MULTIMEDIA MODELING, MMM 2023, PT II, 2023, 13834 : 471 - 482
  • [30] Enhancing the accuracy of transformer-based embeddings for sentiment analysis in social big data
    Zemzem, Wiem
    Tagina, Moncef
    INTERNATIONAL JOURNAL OF COMPUTER APPLICATIONS IN TECHNOLOGY, 2023, 73 (03) : 169 - 177