A document representation framework with interpretable features using pre-trained word embeddings

被引：1

作者：

Unnam, Narendra Babu ^{[1
]}

Reddy, P. Krishna ^{[1
]}

机构：

[1] IIIT Hyderabad, Kohli Ctr Intelligent Syst, Hyderabad, India

来源：

INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS | 2020年 / 10卷 / 01期

关键词：

Text mining; Feature engineering; Document representation; Document classification; Word embeddings;

D O I：

10.1007/s41060-019-00200-5

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose an improved framework for document representation using word embeddings. The existing models represent the document as a position vector in the same word embedding space. As a result, they are unable to capture the multiple aspects as well as the broad context in the document. Also, due to their low representational power, existing approaches perform poorly at document classification. Furthermore, the document vectors obtained using such methods have uninterpretable features. In this paper, we propose an improved document representation framework which captures multiple aspects of the document with interpretable features. In this framework, a document is represented in a different feature space by representing each dimension with a potential feature word with relatively high discriminating power. A given document is modeled as the distances between the feature words and the document. To represent a document, we have proposed two criteria for the selection of potential feature words and a distance function to measure the distance between the feature word and the document. Experimental results on multiple datasets show that the proposed model consistently performs better at document classification over the baseline methods. The proposed approach is simple and represents the document with interpretable word features. Overall, the proposed model provides an alternative framework to represent the larger text units with word embeddings and provides the scope to develop new approaches to improve the performance of document representation and its applications.

引用

页码：49 / 64

页数：16

共 50 条

[41] Balanced Word Clusters for Interpretable Document Representation
Wrzalik, Marco
Krechel, Dirk
2019 INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION WORKSHOPS (ICDARW), VOL 5, 2019, : 103 - 109
[42] Aggregating Neural Word Embeddings for Document Representation
Zhang, Ruqing
Guo, Jiafeng
Lan, Yanyan
Xu, Jun
Cheng, Xueqi
ADVANCES IN INFORMATION RETRIEVAL (ECIR 2018), 2018, 10772 : 303 - 315
[43] Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model
Niizumi, Daisuke
Takeuchi, Daiki
Ohishi, Yasunori
Harada, Noboru
Kashino, Kunio
2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 200 - 204
[44] A Robust Representation with Pre-trained Start and End Characters Vectors for Noisy Word Recognition
Liu, Chao
Ma, Xiangmei
Yu, Min
Wu, Xinghua
Liu, Mingqi
Jiang, Jianguo
Huang, Weiqing
KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT (KSEM 2020), PT I, 2020, 12274 : 174 - 185
[45] Improved Word Sense Disambiguation Using Pre-Trained ContextualizedWord Representations
Hadiwinoto, Christian
Ng, Hwee Tou
Gan, Wee Chung
2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 5297 - 5306
[46] PART: Pre-trained Authorship Representation Transformer
Huertas-Tato, Javier
Martin, Alejandro
Camacho, David
HUMAN-CENTRIC COMPUTING AND INFORMATION SCIENCES, 2024, 14
[47] Speaker Anonymization: Disentangling Speaker Features from Pre-Trained Speech Embeddings for Voice Conversion
Matassoni, Marco
Fong, Seraphina
Brutti, Alessio
APPLIED SCIENCES-BASEL, 2024, 14 (09):
[48] Evaluating Pre-trained Word Embeddings and Neural Network Architectures for Sentiment Analysis in Spanish Financial Tweets
Antonio Garcia-Diaz, Jose
Apolinario-Arzube, Oscar
Valencia-Garcia, Rafael
ADVANCES IN COMPUTATIONAL INTELLIGENCE, MICAI 2020, PT II, 2020, 12469 : 167 - 178
[49] An Investigation of the Interactions Between Pre-Trained Word Embeddings, Character Models and POS Tags in Dependency Parsing
Smith, Aaron
de Lhoneux, Miryam
Stymne, Sara
Nivre, Joakim
2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 2711 - 2720
[50] Exploiting Pre-Trained Network Embeddings for Recommendations in Social Networks
Guo, Lei
Wen, Yu-Fei
Wang, Xin-Hua
JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2018, 33 (04) : 682 - 696

← 1 2 3 4 5 →