A document representation framework with interpretable features using pre-trained word embeddings

被引:1
|
作者
Unnam, Narendra Babu [1 ]
Reddy, P. Krishna [1 ]
机构
[1] IIIT Hyderabad, Kohli Ctr Intelligent Syst, Hyderabad, India
关键词
Text mining; Feature engineering; Document representation; Document classification; Word embeddings;
D O I
10.1007/s41060-019-00200-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose an improved framework for document representation using word embeddings. The existing models represent the document as a position vector in the same word embedding space. As a result, they are unable to capture the multiple aspects as well as the broad context in the document. Also, due to their low representational power, existing approaches perform poorly at document classification. Furthermore, the document vectors obtained using such methods have uninterpretable features. In this paper, we propose an improved document representation framework which captures multiple aspects of the document with interpretable features. In this framework, a document is represented in a different feature space by representing each dimension with a potential feature word with relatively high discriminating power. A given document is modeled as the distances between the feature words and the document. To represent a document, we have proposed two criteria for the selection of potential feature words and a distance function to measure the distance between the feature word and the document. Experimental results on multiple datasets show that the proposed model consistently performs better at document classification over the baseline methods. The proposed approach is simple and represents the document with interpretable word features. Overall, the proposed model provides an alternative framework to represent the larger text units with word embeddings and provides the scope to develop new approaches to improve the performance of document representation and its applications.
引用
收藏
页码:49 / 64
页数:16
相关论文
共 50 条
  • [21] Arabic Fake News Detection in Social Media Context Using Word Embeddings and Pre-trained Transformers
    Azzeh, Mohammad
    Qusef, Abdallah
    Alabboushi, Omar
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2025, 50 (02) : 923 - 936
  • [22] Improving the accuracy using pre-trained word embeddings on deep neural networks for Turkish text classification
    Aydogan, Murat
    Karci, Ali
    PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2020, 541
  • [23] An Investigation of Pre-trained Embeddings in Dependency Parsing
    Carvalho de Araujo, Juliana C.
    Freitas, Claudia
    Pacheco, Marco Aurelio C.
    Forero-Mendoza, Leonardo A.
    COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2020, 2020, 12037 : 281 - 290
  • [24] Leveraging Pre-Trained Embeddings for Welsh Taggers
    Ezeani, Ignatius M.
    Piao, Scott
    Neale, Steven
    Rayson, Paul
    Knight, Dawn
    4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019), 2019, : 270 - 280
  • [25] LSTM Easy-first Dependency Parsing with Pre-trained Word Embeddings and Character-level Word Embeddings in Vietnamese
    Binh Duc Nguyen
    Kiet Van Nguyen
    Ngan Luu-Thuy Nguyen
    PROCEEDINGS OF 2018 10TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2018, : 187 - 192
  • [26] Spatial Role Labeling based on Improved Pre-trained Word Embeddings and Transfer Learning
    Moussa, Alaeddine
    Fournier, Sebastien
    Mahmoudi, Khaoula
    Espinasse, Bernard
    Faiz, Sami
    KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS (KSE 2021), 2021, 192 : 1218 - 1226
  • [27] From Word Embeddings to Pre-Trained Language Models: A State-of-the-Art Walkthrough
    Mars, Mourad
    APPLIED SCIENCES-BASEL, 2022, 12 (17):
  • [28] Pre-trained Affective Word Representations
    Chawla, Kushal
    Khosla, Sopan
    Chhaya, Niyati
    Jaidka, Kokil
    2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2019,
  • [29] An Enhanced Sentiment Analysis Framework Based on Pre-Trained Word Embedding
    Mohamed, Ensaf Hussein
    Moussa, Mohammed ElSaid
    Haggag, Mohamed Hassan
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, 2020, 19 (04)
  • [30] Classification of Respiration Sounds Using Deep Pre-trained Audio Embeddings
    Meza, Carlos A. Galindo
    del Hoyo Ontiveros, Juan A.
    Lopez-Meyer, Paulo
    2021 IEEE LATIN AMERICAN CONFERENCE ON COMPUTATIONAL INTELLIGENCE (LA-CCI), 2021,