A document representation framework with interpretable features using pre-trained word embeddings

被引:1
|
作者
Unnam, Narendra Babu [1 ]
Reddy, P. Krishna [1 ]
机构
[1] IIIT Hyderabad, Kohli Ctr Intelligent Syst, Hyderabad, India
关键词
Text mining; Feature engineering; Document representation; Document classification; Word embeddings;
D O I
10.1007/s41060-019-00200-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose an improved framework for document representation using word embeddings. The existing models represent the document as a position vector in the same word embedding space. As a result, they are unable to capture the multiple aspects as well as the broad context in the document. Also, due to their low representational power, existing approaches perform poorly at document classification. Furthermore, the document vectors obtained using such methods have uninterpretable features. In this paper, we propose an improved document representation framework which captures multiple aspects of the document with interpretable features. In this framework, a document is represented in a different feature space by representing each dimension with a potential feature word with relatively high discriminating power. A given document is modeled as the distances between the feature words and the document. To represent a document, we have proposed two criteria for the selection of potential feature words and a distance function to measure the distance between the feature word and the document. Experimental results on multiple datasets show that the proposed model consistently performs better at document classification over the baseline methods. The proposed approach is simple and represents the document with interpretable word features. Overall, the proposed model provides an alternative framework to represent the larger text units with word embeddings and provides the scope to develop new approaches to improve the performance of document representation and its applications.
引用
收藏
页码:49 / 64
页数:16
相关论文
共 50 条
  • [41] Balanced Word Clusters for Interpretable Document Representation
    Wrzalik, Marco
    Krechel, Dirk
    2019 INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION WORKSHOPS (ICDARW), VOL 5, 2019, : 103 - 109
  • [42] Aggregating Neural Word Embeddings for Document Representation
    Zhang, Ruqing
    Guo, Jiafeng
    Lan, Yanyan
    Xu, Jun
    Cheng, Xueqi
    ADVANCES IN INFORMATION RETRIEVAL (ECIR 2018), 2018, 10772 : 303 - 315
  • [43] Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model
    Niizumi, Daisuke
    Takeuchi, Daiki
    Ohishi, Yasunori
    Harada, Noboru
    Kashino, Kunio
    2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 200 - 204
  • [44] A Robust Representation with Pre-trained Start and End Characters Vectors for Noisy Word Recognition
    Liu, Chao
    Ma, Xiangmei
    Yu, Min
    Wu, Xinghua
    Liu, Mingqi
    Jiang, Jianguo
    Huang, Weiqing
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT (KSEM 2020), PT I, 2020, 12274 : 174 - 185
  • [45] Improved Word Sense Disambiguation Using Pre-Trained ContextualizedWord Representations
    Hadiwinoto, Christian
    Ng, Hwee Tou
    Gan, Wee Chung
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 5297 - 5306
  • [46] PART: Pre-trained Authorship Representation Transformer
    Huertas-Tato, Javier
    Martin, Alejandro
    Camacho, David
    HUMAN-CENTRIC COMPUTING AND INFORMATION SCIENCES, 2024, 14
  • [47] Speaker Anonymization: Disentangling Speaker Features from Pre-Trained Speech Embeddings for Voice Conversion
    Matassoni, Marco
    Fong, Seraphina
    Brutti, Alessio
    APPLIED SCIENCES-BASEL, 2024, 14 (09):
  • [48] Evaluating Pre-trained Word Embeddings and Neural Network Architectures for Sentiment Analysis in Spanish Financial Tweets
    Antonio Garcia-Diaz, Jose
    Apolinario-Arzube, Oscar
    Valencia-Garcia, Rafael
    ADVANCES IN COMPUTATIONAL INTELLIGENCE, MICAI 2020, PT II, 2020, 12469 : 167 - 178
  • [49] An Investigation of the Interactions Between Pre-Trained Word Embeddings, Character Models and POS Tags in Dependency Parsing
    Smith, Aaron
    de Lhoneux, Miryam
    Stymne, Sara
    Nivre, Joakim
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 2711 - 2720
  • [50] Exploiting Pre-Trained Network Embeddings for Recommendations in Social Networks
    Guo, Lei
    Wen, Yu-Fei
    Wang, Xin-Hua
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2018, 33 (04) : 682 - 696