Topic selection for text classification using ensemble topic modeling with grouping, scoring, and modeling approach

被引:1
|
作者
Voskergian, Daniel [1 ]
Jayousi, Rashid [2 ]
Yousef, Malik [3 ]
机构
[1] Al Quds Univ, Comp Engn Dept, Jerusalem, Palestine
[2] Al Quds Univ, Comp Sci Dept, Jerusalem, Palestine
[3] Zefat Acad Coll, Dept Informat Syst, Safed, Israel
来源
SCIENTIFIC REPORTS | 2024年 / 14卷 / 01期
关键词
Topic model; Topic selection; Feature Selection; Ensemble learning; Text classification; Machine learning;
D O I
10.1038/s41598-024-74022-2
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
TextNetTopics (Yousef et al. in Front Genet 13:893378, 2022. https://doi.org/10.3389/fgene.2022.893378) is a recently developed approach that performs text classification-based topics (a topic is a group of terms or words) extracted from a Latent Dirichlet Allocation topic modeling as features rather than individual words. Following this approach enables TextNetTopics to fulfill dimensionality reduction while preserving and embedding more thematic and semantic information into the text document representations. In this article, we introduced a novel approach, the Ensemble Topic Model for Topic Selection (ENTM-TS), an advancement of TextNetTopics. ENTM-TS integrates multiple topic models using the Grouping, Scoring, and Modeling approach, thereby mitigating the performance variability introduced by employing individual topic modeling methods within TextNetTopics. Additionally, we performed a thorough comparative study to evaluate TextNetTopics' performance using eleven state-of-the-art topic modeling algorithms. We used the extracted topics for each as input to the G component in the TextNetTopics tool to select the most compelling topic model regarding their predictive behavior for text classification. We conducted our comprehensive evaluation utilizing the Drug-Induced Liver Injury textual dataset from the CAMDA community and the WOS-5736 dataset. The experimental results show that the Latent Semantic Indexing provides comparable performance measures with fewer discriminative features when compared with other topic modeling methods. Moreover, our evaluation reveals that the performance of ENTM-TS surpasses or aligns with the optimal outcomes obtained from individual topic models across the two datasets, establishing it as a robust and effective enhancement in text classification tasks.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Dataless Text Classification: A Topic Modeling Approach with Document Manifold
    Li, Ximing
    Li, Changchun
    Chi, Jinjin
    Ouyang, Jihong
    Li, Chenliang
    CIKM'18: PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2018, : 973 - 982
  • [2] Topic Modeling Based Text Summarization Approach
    Yu, Shusi
    Wang, Wei
    2015 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND ENGINEERING APPLICATIONS (CSEA 2015), 2015, : 203 - 207
  • [3] Feature selection for text data via topic modeling
    Jang, Woosol
    Kim, Ye Eun
    Son, Won
    KOREAN JOURNAL OF APPLIED STATISTICS, 2022, 35 (06) : 739 - 754
  • [4] Topic Modeling for Interpretable Text Classification From EHRs
    Rijcken, Emil
    Kaymak, Uzay
    Scheepers, Floortje
    Mosteiro, Pablo
    Zervanou, Kalliopi
    Spruit, Marco
    FRONTIERS IN BIG DATA, 2022, 5
  • [5] Automated classification of patents: A topic modeling approach
    Yun, Junghwan
    Geum, Youngjung
    COMPUTERS & INDUSTRIAL ENGINEERING, 2020, 147
  • [6] A Study on Topic Modeling for Feature Space Reduction in Text Classification
    Pfeifer, Daniel
    Leidner, Jochen L.
    FLEXIBLE QUERY ANSWERING SYSTEMS, 2019, 11529 : 403 - 412
  • [7] Multi-label dataless text classification with topic modeling
    Zha, Daochen
    Li, Chenliang
    KNOWLEDGE AND INFORMATION SYSTEMS, 2019, 61 (01) : 137 - 160
  • [8] Multi-label dataless text classification with topic modeling
    Daochen Zha
    Chenliang Li
    Knowledge and Information Systems, 2019, 61 : 137 - 160
  • [9] Using Topic Modeling in Classification of Brazilian Lawsuits
    Aguiar, Andre
    Silveira, Raquel
    Furtado, Vasco
    Pinheiro, Vladia
    Monteiro Neto, Joao A.
    COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2022, 2022, 13208 : 233 - 242
  • [10] A Hybrid Classification Approach using Topic Modeling and Graph Convolution Networks
    Singh, Thoudam Doren
    Divyansha
    Singh, Apoorva Vikram
    Khilji, Abdullah Faiz Ur Rahman
    2020 INTERNATIONAL CONFERENCE ON COMPUTATIONAL PERFORMANCE EVALUATION (COMPE-2020), 2020, : 285 - 289