A document representation framework with interpretable features using pre-trained word embeddings

被引:1
|
作者
Unnam, Narendra Babu [1 ]
Reddy, P. Krishna [1 ]
机构
[1] IIIT Hyderabad, Kohli Ctr Intelligent Syst, Hyderabad, India
关键词
Text mining; Feature engineering; Document representation; Document classification; Word embeddings;
D O I
10.1007/s41060-019-00200-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose an improved framework for document representation using word embeddings. The existing models represent the document as a position vector in the same word embedding space. As a result, they are unable to capture the multiple aspects as well as the broad context in the document. Also, due to their low representational power, existing approaches perform poorly at document classification. Furthermore, the document vectors obtained using such methods have uninterpretable features. In this paper, we propose an improved document representation framework which captures multiple aspects of the document with interpretable features. In this framework, a document is represented in a different feature space by representing each dimension with a potential feature word with relatively high discriminating power. A given document is modeled as the distances between the feature words and the document. To represent a document, we have proposed two criteria for the selection of potential feature words and a distance function to measure the distance between the feature word and the document. Experimental results on multiple datasets show that the proposed model consistently performs better at document classification over the baseline methods. The proposed approach is simple and represents the document with interpretable word features. Overall, the proposed model provides an alternative framework to represent the larger text units with word embeddings and provides the scope to develop new approaches to improve the performance of document representation and its applications.
引用
收藏
页码:49 / 64
页数:16
相关论文
共 50 条
  • [31] Radiological Report Generation from Chest X-ray Images Using Pre-trained Word Embeddings
    Alotaibi, Fahd Saleh
    Kaur, Navdeep
    WIRELESS PERSONAL COMMUNICATIONS, 2023, 133 (04) : 2525 - 2540
  • [32] Radiological Report Generation from Chest X-ray Images Using Pre-trained Word Embeddings
    Fahd Saleh Alotaibi
    Navdeep Kaur
    Wireless Personal Communications, 2023, 133 : 2525 - 2540
  • [33] Evaluation Metrics for Headline Generation Using Deep Pre-Trained Embeddings
    Moeed, Abdul
    An, Yang
    Hagerer, Gerhard
    Groh, Georg
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 1796 - 1802
  • [34] Enhancing Pre-trained Chinese Character Representation with Word-aligned Attention
    Li, Yanzeng
    Yu, Bowen
    Xue, Mengge
    Liu, Tingwen
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3442 - 3448
  • [35] Handwritten Document Recognition Using Pre-trained Vision Transformers
    Parres, Daniel
    Anitei, Dan
    Paredes, Roberto
    DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT II, 2024, 14805 : 173 - 190
  • [36] Improving document representation using KPCA and clustered word embeddings
    Gupta, Aakansha
    Katarya, Rahul
    2021 5TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONICS, COMMUNICATION, COMPUTER TECHNOLOGIES AND OPTIMIZATION TECHNIQUES (ICEECCOT), 2021, : 514 - 517
  • [37] Pre-trained Word Embeddings for Arabic Aspect-Based Sentiment Analysis of Airline Tweets
    Ashi, Mohammed Matuq
    Siddiqui, Muazzam Ahmed
    Nadeem, Farrukh
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT SYSTEMS AND INFORMATICS 2018, 2019, 845 : 241 - 251
  • [38] On the Role of Pre-trained Embeddings in Binary Code Analysis
    Maier, Alwin
    Weissberg, Felix
    Rieck, Konrad
    PROCEEDINGS OF THE 19TH ACM ASIA CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, ACM ASIACCS 2024, 2024, : 795 - 810
  • [39] Pre-trained Embeddings for Entity Resolution: An Experimental Analysis
    Zeakis, Alexandros
    Papadakis, George
    Skoutas, Dimitrios
    Koubarakis, Manolis
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (09): : 2225 - 2238
  • [40] On the Sentence Embeddings from Pre-trained Language Models
    Li, Bohan
    Zhou, Hao
    He, Junxian
    Wang, Mingxuan
    Yang, Yiming
    Li, Lei
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 9119 - 9130