Variance-based features for keyword extraction in Persian and English text documents

被引:0
|
作者
Veisi H. [1 ]
Aflaki N. [2 ,3 ]
Parsafard P. [2 ]
机构
[1] Faculty of New Sciences and Technologies (FNST), University of Tehran, Tehran
[2] Kish International Campus, University of Tehran, Kish
[3] Geoinformatics Collaboratory, School of Natural and Computational Sciences, Massey University, Auckland
关键词
Clustering; Extraction; Persian text processing; Term frequency; Variance;
D O I
10.24200/SCI.2019.50426.1685
中图分类号
学科分类号
摘要
This paper addresses automatic keyword extraction in Persian and English text documents. Generally, to extract keywords from a text, a weight is assigned to each token, and words characterized by higher weights are selected as the keywords. This study proposed four methods for weighting the words and compared these methods with five previous weighting techniques. The previous methods used in this paper include Term Frequency (TF), Term Frequency Inverse Document Frequency (TF-IDF), variance, Discriminative Feature Selection (DFS), and document length normalization based on unit words (LNU). The proposed weighting methods are presented using variance features and include variance to TF-IDF ratio, variance to TF ratio, the intersection of TF and variance, and the intersection of variance and IDF. For evaluation, the documents are clustered using the extracted keywords as feature vectors and by using K-means, Expectation Maximization (EM), and Ward hierarchical clustering methods. The entropy of the clusters and predefined classes of the documents are used as the evaluation metrics. For the evaluations, this study collected and labeled Persian documents. Results showed that the proposed weighting method, variance to TF ratio, showed the best performance for Persian texts. Moreover, the best entropy was found by variance to TD-IDF ratio for English texts. © 2020 Sharif University of Technology. All rights reserved.
引用
收藏
页码:1301 / 1315
页数:14
相关论文
共 50 条
  • [1] Variance-based features for keyword extraction in Persian and English text documents
    Veisi, H.
    Aflaki, N.
    Parsafard, P.
    SCIENTIA IRANICA, 2020, 27 (03) : 1301 - 1315
  • [2] A Text Feature Based Automatic Keyword Extraction Method for Single Documents
    Campos, Ricardo
    Mangaravite, Vitor
    Pasquali, Arian
    Jorge, Alipio Mario
    Nunes, Celia
    Jatowt, Adam
    ADVANCES IN INFORMATION RETRIEVAL (ECIR 2018), 2018, 10772 : 684 - 691
  • [3] Deep-KeywordNet: automated english keyword extraction in documents using deep keyword network based ranking
    Khatun, Rubaya
    Sarkar, Arup
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (27) : 68959 - 68991
  • [4] Text Keyword Extraction Based on GPT
    He, Pinyao
    Huang, Jingyue
    Li, Ming
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 1394 - 1398
  • [5] Modified Variance-based Transition Region Extraction and Thresholding
    Li, Zuoyong
    2012 5TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING (CISP), 2012, : 684 - 687
  • [6] Shape Variance-Based Feature Extraction for Biometric Fingerprint Analysis
    Gavrilescu, Marius
    2021 25TH INTERNATIONAL CONFERENCE ON SYSTEM THEORY, CONTROL AND COMPUTING (ICSTCC), 2021, : 515 - 518
  • [7] A new variance-based approach for discriminative feature extraction in machine hearing classification using spectrogram features
    Xie, Zhipeng
    McLoughlina, Ian
    Zhang, Haomin
    Song, Yan
    Xiao, Wei
    DIGITAL SIGNAL PROCESSING, 2016, 54 : 119 - 128
  • [8] Neural based approach to keyword extraction from documents
    Jo, TH
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2003, PT 1, PROCEEDINGS, 2003, 2667 : 456 - 461
  • [9] PAKE: a supervised approach for Persian automatic keyword extraction using statistical features
    Soghra Lazemi
    Hossein Ebrahimpour-Komleh
    Nasser Noroozi
    SN Applied Sciences, 2019, 1
  • [10] PAKE: a supervised approach for Persian automatic keyword extraction using statistical features
    Lazemi, Soghra
    Ebrahimpour-Komleh, Hossein
    Noroozi, Nasser
    SN APPLIED SCIENCES, 2019, 1 (12):