A document representation framework with interpretable features using pre-trained word embeddings

被引:0
|
作者
Narendra Babu Unnam
P. Krishna Reddy
机构
[1] IIIT Hyderabad,Kohli Centre on Intelligent Systems
关键词
Text mining; Feature engineering; Document representation; Document classification; Word embeddings;
D O I
暂无
中图分类号
学科分类号
摘要
We propose an improved framework for document representation using word embeddings. The existing models represent the document as a position vector in the same word embedding space. As a result, they are unable to capture the multiple aspects as well as the broad context in the document. Also, due to their low representational power, existing approaches perform poorly at document classification. Furthermore, the document vectors obtained using such methods have uninterpretable features. In this paper, we propose an improved document representation framework which captures multiple aspects of the document with interpretable features. In this framework, a document is represented in a different feature space by representing each dimension with a potential feature word with relatively high discriminating power. A given document is modeled as the distances between the feature words and the document. To represent a document, we have proposed two criteria for the selection of potential feature words and a distance function to measure the distance between the feature word and the document. Experimental results on multiple datasets show that the proposed model consistently performs better at document classification over the baseline methods. The proposed approach is simple and represents the document with interpretable word features. Overall, the proposed model provides an alternative framework to represent the larger text units with word embeddings and provides the scope to develop new approaches to improve the performance of document representation and its applications.
引用
收藏
页码:49 / 64
页数:15
相关论文
共 50 条
  • [1] A document representation framework with interpretable features using pre-trained word embeddings
    Unnam, Narendra Babu
    Reddy, P. Krishna
    INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, 2020, 10 (01) : 49 - 64
  • [2] The impact of using pre-trained word embeddings in Sinhala chatbots
    Gamage, Bimsara
    Pushpananda, Randil
    Weerasinghe, Ruvan
    2020 20TH INTERNATIONAL CONFERENCE ON ADVANCES IN ICT FOR EMERGING REGIONS (ICTER-2020), 2020, : 161 - 165
  • [3] Disambiguating Clinical Abbreviations using Pre-trained Word Embeddings
    Jaber, Areej
    Martinez, Paloma
    HEALTHINF: PROCEEDINGS OF THE 14TH INTERNATIONAL JOINT CONFERENCE ON BIOMEDICAL ENGINEERING SYSTEMS AND TECHNOLOGIES - VOL. 5: HEALTHINF, 2021, : 501 - 508
  • [4] The POLAR Framework: Polar Opposites Enable Interpretability of Pre-Trained Word Embeddings
    Mathew, Binny
    Sikdar, Sandipan
    Lemmerich, Florian
    Strohmaier, Markus
    WEB CONFERENCE 2020: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2020), 2020, : 1548 - 1558
  • [5] Automated Employee Objective Matching Using Pre-trained Word Embeddings
    Ghanem, Mohab
    Elnaggar, Ahmed
    Mckinnon, Adam
    Debes, Christian
    Boisard, Olivier
    Matthes, Florian
    2021 IEEE 25TH INTERNATIONAL ENTERPRISE DISTRIBUTED OBJECT COMPUTING CONFERENCE (EDOC 2021), 2021, : 51 - 60
  • [6] Sentiment analysis based on improved pre-trained word embeddings
    Rezaeinia, Seyed Mahdi
    Rahmani, Rouhollah
    Ghodsi, Ali
    Veisi, Hadi
    EXPERT SYSTEMS WITH APPLICATIONS, 2019, 117 : 139 - 147
  • [7] Dictionary-based Debiasing of Pre-trained Word Embeddings
    Kaneko, Masahiro
    Bollegala, Danushka
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 212 - 223
  • [8] Embodying Pre-Trained Word Embeddings Through Robot Actions
    Toyoda, Minori
    Suzuki, Kanata
    Mori, Hiroki
    Hayashi, Yoshihiko
    Ogata, Tetsuya
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2021, 6 (02): : 4225 - 4232
  • [9] Gender-preserving Debiasing for Pre-trained Word Embeddings
    Kaneko, Masahiro
    Bollegala, Danushka
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1641 - 1650
  • [10] Predicting the Valence Rating of Russian Words Using Various Pre-trained Word Embeddings
    Bochkarev, Vladimir V.
    Savinkov, Andrey, V
    Shevlyakova, Anna, V
    SPEECH AND COMPUTER, SPECOM 2024, PT II, 2025, 15300 : 349 - 361