A document representation framework with interpretable features using pre-trained word embeddings

被引：0

作者：

Narendra Babu Unnam

P. Krishna Reddy

机构：

[1] IIIT Hyderabad,Kohli Centre on Intelligent Systems

来源：

International Journal of Data Science and Analytics | 2020年 / 10卷

关键词：

Text mining; Feature engineering; Document representation; Document classification; Word embeddings;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

We propose an improved framework for document representation using word embeddings. The existing models represent the document as a position vector in the same word embedding space. As a result, they are unable to capture the multiple aspects as well as the broad context in the document. Also, due to their low representational power, existing approaches perform poorly at document classification. Furthermore, the document vectors obtained using such methods have uninterpretable features. In this paper, we propose an improved document representation framework which captures multiple aspects of the document with interpretable features. In this framework, a document is represented in a different feature space by representing each dimension with a potential feature word with relatively high discriminating power. A given document is modeled as the distances between the feature words and the document. To represent a document, we have proposed two criteria for the selection of potential feature words and a distance function to measure the distance between the feature word and the document. Experimental results on multiple datasets show that the proposed model consistently performs better at document classification over the baseline methods. The proposed approach is simple and represents the document with interpretable word features. Overall, the proposed model provides an alternative framework to represent the larger text units with word embeddings and provides the scope to develop new approaches to improve the performance of document representation and its applications.

引用

页码：49 / 64

页数：15

共 50 条

[1] A document representation framework with interpretable features using pre-trained word embeddings
Unnam, Narendra Babu
Reddy, P. Krishna
INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, 2020, 10 (01) : 49 - 64
[2] The impact of using pre-trained word embeddings in Sinhala chatbots
Gamage, Bimsara
Pushpananda, Randil
Weerasinghe, Ruvan
2020 20TH INTERNATIONAL CONFERENCE ON ADVANCES IN ICT FOR EMERGING REGIONS (ICTER-2020), 2020, : 161 - 165
[3] Disambiguating Clinical Abbreviations using Pre-trained Word Embeddings
Jaber, Areej
Martinez, Paloma
HEALTHINF: PROCEEDINGS OF THE 14TH INTERNATIONAL JOINT CONFERENCE ON BIOMEDICAL ENGINEERING SYSTEMS AND TECHNOLOGIES - VOL. 5: HEALTHINF, 2021, : 501 - 508
[4] The POLAR Framework: Polar Opposites Enable Interpretability of Pre-Trained Word Embeddings
Mathew, Binny
Sikdar, Sandipan
Lemmerich, Florian
Strohmaier, Markus
WEB CONFERENCE 2020: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2020), 2020, : 1548 - 1558
[5] Automated Employee Objective Matching Using Pre-trained Word Embeddings
Ghanem, Mohab
Elnaggar, Ahmed
Mckinnon, Adam
Debes, Christian
Boisard, Olivier
Matthes, Florian
2021 IEEE 25TH INTERNATIONAL ENTERPRISE DISTRIBUTED OBJECT COMPUTING CONFERENCE (EDOC 2021), 2021, : 51 - 60
[6] Sentiment analysis based on improved pre-trained word embeddings
Rezaeinia, Seyed Mahdi
Rahmani, Rouhollah
Ghodsi, Ali
Veisi, Hadi
EXPERT SYSTEMS WITH APPLICATIONS, 2019, 117 : 139 - 147
[7] Dictionary-based Debiasing of Pre-trained Word Embeddings
Kaneko, Masahiro
Bollegala, Danushka
16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 212 - 223
[8] Embodying Pre-Trained Word Embeddings Through Robot Actions
Toyoda, Minori
Suzuki, Kanata
Mori, Hiroki
Hayashi, Yoshihiko
Ogata, Tetsuya
IEEE ROBOTICS AND AUTOMATION LETTERS, 2021, 6 (02): : 4225 - 4232
[9] Gender-preserving Debiasing for Pre-trained Word Embeddings
Kaneko, Masahiro
Bollegala, Danushka
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1641 - 1650
[10] Predicting the Valence Rating of Russian Words Using Various Pre-trained Word Embeddings
Bochkarev, Vladimir V.
Savinkov, Andrey, V
Shevlyakova, Anna, V
SPEECH AND COMPUTER, SPECOM 2024, PT II, 2025, 15300 : 349 - 361

← 1 2 3 4 5 →