Variance-based features for keyword extraction in Persian and English text documents

被引:0
|
作者
Veisi H. [1 ]
Aflaki N. [2 ,3 ]
Parsafard P. [2 ]
机构
[1] Faculty of New Sciences and Technologies (FNST), University of Tehran, Tehran
[2] Kish International Campus, University of Tehran, Kish
[3] Geoinformatics Collaboratory, School of Natural and Computational Sciences, Massey University, Auckland
关键词
Clustering; Extraction; Persian text processing; Term frequency; Variance;
D O I
10.24200/SCI.2019.50426.1685
中图分类号
学科分类号
摘要
This paper addresses automatic keyword extraction in Persian and English text documents. Generally, to extract keywords from a text, a weight is assigned to each token, and words characterized by higher weights are selected as the keywords. This study proposed four methods for weighting the words and compared these methods with five previous weighting techniques. The previous methods used in this paper include Term Frequency (TF), Term Frequency Inverse Document Frequency (TF-IDF), variance, Discriminative Feature Selection (DFS), and document length normalization based on unit words (LNU). The proposed weighting methods are presented using variance features and include variance to TF-IDF ratio, variance to TF ratio, the intersection of TF and variance, and the intersection of variance and IDF. For evaluation, the documents are clustered using the extracted keywords as feature vectors and by using K-means, Expectation Maximization (EM), and Ward hierarchical clustering methods. The entropy of the clusters and predefined classes of the documents are used as the evaluation metrics. For the evaluations, this study collected and labeled Persian documents. Results showed that the proposed weighting method, variance to TF ratio, showed the best performance for Persian texts. Moreover, the best entropy was found by variance to TD-IDF ratio for English texts. © 2020 Sharif University of Technology. All rights reserved.
引用
收藏
页码:1301 / 1315
页数:14
相关论文
共 50 条
  • [41] SIFRANK Algorithm for Chinese Text Keyword Extraction Based on Dependent Semantic Feature Constraints
    Zhang, Qian
    Wang, Tiancheng
    Zhu, Mengyuan
    Shen, Tao
    Zhao, Yilin
    Zhang, Yunwei
    2022 IEEE 17TH CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA), 2022, : 1652 - 1657
  • [42] Inside Importance Factors of Graph-Based Keyword Extraction on Chinese Short Text
    Chen, Junjie
    Hou, Hongxu
    Gao, Jing
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (05)
  • [43] Language-independent extractive automatic text summarization based on automatic keyword extraction
    Hernandez-Castaneda, Angel
    Arnulfo Garcia-Hernandez, Rene
    Ledeneva, Yulia
    Eduardo Millan-Hernandez, Christian
    COMPUTER SPEECH AND LANGUAGE, 2022, 71
  • [44] Keyword Extraction for Medium-Sized Documents Using Corpus-Based Contextual Semantic Smoothing
    Khan, Osama A.
    Wasi, Shaukat
    Siddiqui, Muhammad Shoaib
    Karim, Asim
    COMPLEXITY, 2022, 2022
  • [45] Research on keyword extraction of Tibetan web news based on improved TEXT-RANK algorithm
    Lan, Chuanqi
    Yu, Hongzhi
    Xu, Tao
    Liu, Peixin
    Li, Jiuyi
    PROCEEDINGS OF 2017 IEEE 2ND INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC), 2017, : 208 - 212
  • [46] Feature Extraction for Co-Occurrence-Based Cosine Similarity Score of Text Documents
    Kadhim, Ammar Ismael
    Cheah, Yu-N
    Ahamed, Nurul Hashimah
    Salman, Lubab A.
    2014 IEEE STUDENT CONFERENCE ON RESEARCH AND DEVELOPMENT (SCORED), 2014,
  • [47] Text-Line Extraction in Handwritten Chinese Documents Based on an Energy Minimization Framework
    Koo, Hyung Il
    Cho, Nam Ik
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2012, 21 (03) : 1169 - 1175
  • [48] Coalition game based feature selection for text non-text separation in handwritten documents using LBP based features
    Ghosh, Manosij
    Ghosh, Kushal Kanti
    Bhowmik, Showmik
    Sarkar, Ram
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (02) : 3229 - 3249
  • [49] Coalition game based feature selection for text non-text separation in handwritten documents using LBP based features
    Manosij Ghosh
    Kushal Kanti Ghosh
    Showmik Bhowmik
    Ram Sarkar
    Multimedia Tools and Applications, 2021, 80 : 3229 - 3249
  • [50] Classification of Text Documents based on Naive Bayes using N-Gram Features
    Baygin, Mehmet
    2018 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND DATA PROCESSING (IDAP), 2018,