Variance-based features for keyword extraction in Persian and English text documents

被引:0
|
作者
Veisi H. [1 ]
Aflaki N. [2 ,3 ]
Parsafard P. [2 ]
机构
[1] Faculty of New Sciences and Technologies (FNST), University of Tehran, Tehran
[2] Kish International Campus, University of Tehran, Kish
[3] Geoinformatics Collaboratory, School of Natural and Computational Sciences, Massey University, Auckland
关键词
Clustering; Extraction; Persian text processing; Term frequency; Variance;
D O I
10.24200/SCI.2019.50426.1685
中图分类号
学科分类号
摘要
This paper addresses automatic keyword extraction in Persian and English text documents. Generally, to extract keywords from a text, a weight is assigned to each token, and words characterized by higher weights are selected as the keywords. This study proposed four methods for weighting the words and compared these methods with five previous weighting techniques. The previous methods used in this paper include Term Frequency (TF), Term Frequency Inverse Document Frequency (TF-IDF), variance, Discriminative Feature Selection (DFS), and document length normalization based on unit words (LNU). The proposed weighting methods are presented using variance features and include variance to TF-IDF ratio, variance to TF ratio, the intersection of TF and variance, and the intersection of variance and IDF. For evaluation, the documents are clustered using the extracted keywords as feature vectors and by using K-means, Expectation Maximization (EM), and Ward hierarchical clustering methods. The entropy of the clusters and predefined classes of the documents are used as the evaluation metrics. For the evaluations, this study collected and labeled Persian documents. Results showed that the proposed weighting method, variance to TF ratio, showed the best performance for Persian texts. Moreover, the best entropy was found by variance to TD-IDF ratio for English texts. © 2020 Sharif University of Technology. All rights reserved.
引用
收藏
页码:1301 / 1315
页数:14
相关论文
共 50 条
  • [21] Extraction of English Keyword Information Based on CAD Mesh Model
    Wu, Xiuying
    Yang, Liuhui
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2022, 2022
  • [22] Automatic keyword extraction from documents based on multiple content-based measures
    Yue, Kun
    Liu, Wei-Yi
    Zhou, Li-Ping
    COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2011, 26 (02): : 133 - 145
  • [23] Keyword Combination Extraction in Text Categorization Based on Ant Colony Optimization
    Yu, Zi-jun
    Wu, Wei-gang
    Xiao, Jing
    Zhang, Jun
    Huang, Rui-Zhang
    Liu, Ou
    2009 INTERNATIONAL CONFERENCE OF SOFT COMPUTING AND PATTERN RECOGNITION, 2009, : 430 - +
  • [24] A Novel Statistical and Linguistic Features Based Technique for Keyword Extraction
    Gupta, Ashlesha
    Dixit, Ashutosh
    Sharma, A. K.
    PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS AND COMPUTER NETWORKS (ISCON), 2014, : 55 - 59
  • [25] RoBERTa-Based Keyword Extraction from Small Number of Korean Documents
    Kim, So-Eon
    Lee, Jun-Beom
    Park, Gyu-Min
    Sohn, Seok-Man
    Park, Seong-Bae
    ELECTRONICS, 2023, 12 (22)
  • [26] Keyword spotting in handwritten documents based on a generic text line HMM and a SVM verification
    Kessentini, Yousri
    Paquet, Thierry
    2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 41 - 45
  • [27] Video Text Localization and Extraction Based on Variance Projection and Morphology
    Zhao Aiqun
    Zhang Dongyu
    Xu Xiaofei
    INFORMATION TECHNOLOGY FOR MANUFACTURING SYSTEMS, PTS 1 AND 2, 2010, : 1143 - +
  • [28] Acoustic classification and segmentation using modified spectral roll-off and variance-based features
    Kos, Marko
    Kacic, Zdravko
    Vlaj, Damjan
    DIGITAL SIGNAL PROCESSING, 2013, 23 (02) : 659 - 674
  • [29] Uyghur-Kazakh-Kirghiz Text Keyword Extraction Based on Morpheme Segmentation
    Parhat, Sardar
    Sattar, Mutallip
    Hamdulla, Askar
    Kadir, Abdurahman
    INFORMATION, 2023, 14 (05)
  • [30] Research on Cross Language Text Keyword Extraction Based on Information Entropy and TextRank
    Zhang, Xiaoyu
    Wang, Yongbin
    Wu, Lin
    PROCEEDINGS OF 2019 IEEE 3RD INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC 2019), 2019, : 16 - 19