Variance-based features for keyword extraction in Persian and English text documents

被引：0

作者：

Veisi H. ^{[1
]}

Aflaki N. ^{[2
,3
]}

Parsafard P. ^{[2
]}

机构：

[1] Faculty of New Sciences and Technologies (FNST), University of Tehran, Tehran

[2] Kish International Campus, University of Tehran, Kish

[3] Geoinformatics Collaboratory, School of Natural and Computational Sciences, Massey University, Auckland

来源：

Scientia Iranica | 2020年 / 27卷 / 3 D期

关键词：

Clustering; Extraction; Persian text processing; Term frequency; Variance;

D O I：

10.24200/SCI.2019.50426.1685

中图分类号：

学科分类号：

摘要：

This paper addresses automatic keyword extraction in Persian and English text documents. Generally, to extract keywords from a text, a weight is assigned to each token, and words characterized by higher weights are selected as the keywords. This study proposed four methods for weighting the words and compared these methods with five previous weighting techniques. The previous methods used in this paper include Term Frequency (TF), Term Frequency Inverse Document Frequency (TF-IDF), variance, Discriminative Feature Selection (DFS), and document length normalization based on unit words (LNU). The proposed weighting methods are presented using variance features and include variance to TF-IDF ratio, variance to TF ratio, the intersection of TF and variance, and the intersection of variance and IDF. For evaluation, the documents are clustered using the extracted keywords as feature vectors and by using K-means, Expectation Maximization (EM), and Ward hierarchical clustering methods. The entropy of the clusters and predefined classes of the documents are used as the evaluation metrics. For the evaluations, this study collected and labeled Persian documents. Results showed that the proposed weighting method, variance to TF ratio, showed the best performance for Persian texts. Moreover, the best entropy was found by variance to TD-IDF ratio for English texts. © 2020 Sharif University of Technology. All rights reserved.

引用

页码：1301 / 1315

页数：14

共 50 条

[41] SIFRANK Algorithm for Chinese Text Keyword Extraction Based on Dependent Semantic Feature Constraints
Zhang, Qian
Wang, Tiancheng
Zhu, Mengyuan
Shen, Tao
Zhao, Yilin
Zhang, Yunwei
2022 IEEE 17TH CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA), 2022, : 1652 - 1657
[42] Inside Importance Factors of Graph-Based Keyword Extraction on Chinese Short Text
Chen, Junjie
Hou, Hongxu
Gao, Jing
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (05)
[43] Language-independent extractive automatic text summarization based on automatic keyword extraction
Hernandez-Castaneda, Angel
Arnulfo Garcia-Hernandez, Rene
Ledeneva, Yulia
Eduardo Millan-Hernandez, Christian
COMPUTER SPEECH AND LANGUAGE, 2022, 71
[44] Keyword Extraction for Medium-Sized Documents Using Corpus-Based Contextual Semantic Smoothing
Khan, Osama A.
Wasi, Shaukat
Siddiqui, Muhammad Shoaib
Karim, Asim
COMPLEXITY, 2022, 2022
[45] Research on keyword extraction of Tibetan web news based on improved TEXT-RANK algorithm
Lan, Chuanqi
Yu, Hongzhi
Xu, Tao
Liu, Peixin
Li, Jiuyi
PROCEEDINGS OF 2017 IEEE 2ND INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC), 2017, : 208 - 212
[46] Feature Extraction for Co-Occurrence-Based Cosine Similarity Score of Text Documents
Kadhim, Ammar Ismael
Cheah, Yu-N
Ahamed, Nurul Hashimah
Salman, Lubab A.
2014 IEEE STUDENT CONFERENCE ON RESEARCH AND DEVELOPMENT (SCORED), 2014,
[47] Text-Line Extraction in Handwritten Chinese Documents Based on an Energy Minimization Framework
Koo, Hyung Il
Cho, Nam Ik
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2012, 21 (03) : 1169 - 1175
[48] Coalition game based feature selection for text non-text separation in handwritten documents using LBP based features
Ghosh, Manosij
Ghosh, Kushal Kanti
Bhowmik, Showmik
Sarkar, Ram
MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (02) : 3229 - 3249
[49] Coalition game based feature selection for text non-text separation in handwritten documents using LBP based features
Manosij Ghosh
Kushal Kanti Ghosh
Showmik Bhowmik
Ram Sarkar
Multimedia Tools and Applications, 2021, 80 : 3229 - 3249
[50] Classification of Text Documents based on Naive Bayes using N-Gram Features
Baygin, Mehmet
2018 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND DATA PROCESSING (IDAP), 2018,

← 1 2 3 4 5 →