A document representation framework with interpretable features using pre-trained word embeddings

被引：1

作者：

Unnam, Narendra Babu ^{[1
]}

Reddy, P. Krishna ^{[1
]}

机构：

[1] IIIT Hyderabad, Kohli Ctr Intelligent Syst, Hyderabad, India

来源：

INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS | 2020年 / 10卷 / 01期

关键词：

Text mining; Feature engineering; Document representation; Document classification; Word embeddings;

D O I：

10.1007/s41060-019-00200-5

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose an improved framework for document representation using word embeddings. The existing models represent the document as a position vector in the same word embedding space. As a result, they are unable to capture the multiple aspects as well as the broad context in the document. Also, due to their low representational power, existing approaches perform poorly at document classification. Furthermore, the document vectors obtained using such methods have uninterpretable features. In this paper, we propose an improved document representation framework which captures multiple aspects of the document with interpretable features. In this framework, a document is represented in a different feature space by representing each dimension with a potential feature word with relatively high discriminating power. A given document is modeled as the distances between the feature words and the document. To represent a document, we have proposed two criteria for the selection of potential feature words and a distance function to measure the distance between the feature word and the document. Experimental results on multiple datasets show that the proposed model consistently performs better at document classification over the baseline methods. The proposed approach is simple and represents the document with interpretable word features. Overall, the proposed model provides an alternative framework to represent the larger text units with word embeddings and provides the scope to develop new approaches to improve the performance of document representation and its applications.

引用

页码：49 / 64

页数：16

共 50 条

[21] Arabic Fake News Detection in Social Media Context Using Word Embeddings and Pre-trained Transformers
Azzeh, Mohammad
Qusef, Abdallah
Alabboushi, Omar
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2025, 50 (02) : 923 - 936
[22] Improving the accuracy using pre-trained word embeddings on deep neural networks for Turkish text classification
Aydogan, Murat
Karci, Ali
PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2020, 541
[23] An Investigation of Pre-trained Embeddings in Dependency Parsing
Carvalho de Araujo, Juliana C.
Freitas, Claudia
Pacheco, Marco Aurelio C.
Forero-Mendoza, Leonardo A.
COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2020, 2020, 12037 : 281 - 290
[24] Leveraging Pre-Trained Embeddings for Welsh Taggers
Ezeani, Ignatius M.
Piao, Scott
Neale, Steven
Rayson, Paul
Knight, Dawn
4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019), 2019, : 270 - 280
[25] LSTM Easy-first Dependency Parsing with Pre-trained Word Embeddings and Character-level Word Embeddings in Vietnamese
Binh Duc Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
PROCEEDINGS OF 2018 10TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2018, : 187 - 192
[26] Spatial Role Labeling based on Improved Pre-trained Word Embeddings and Transfer Learning
Moussa, Alaeddine
Fournier, Sebastien
Mahmoudi, Khaoula
Espinasse, Bernard
Faiz, Sami
KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS (KSE 2021), 2021, 192 : 1218 - 1226
[27] From Word Embeddings to Pre-Trained Language Models: A State-of-the-Art Walkthrough
Mars, Mourad
APPLIED SCIENCES-BASEL, 2022, 12 (17):
[28] Pre-trained Affective Word Representations
Chawla, Kushal
Khosla, Sopan
Chhaya, Niyati
Jaidka, Kokil
2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2019,
[29] An Enhanced Sentiment Analysis Framework Based on Pre-Trained Word Embedding
Mohamed, Ensaf Hussein
Moussa, Mohammed ElSaid
Haggag, Mohamed Hassan
INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, 2020, 19 (04)
[30] Classification of Respiration Sounds Using Deep Pre-trained Audio Embeddings
Meza, Carlos A. Galindo
del Hoyo Ontiveros, Juan A.
Lopez-Meyer, Paulo
2021 IEEE LATIN AMERICAN CONFERENCE ON COMPUTATIONAL INTELLIGENCE (LA-CCI), 2021,

← 1 2 3 4 5 →