Level statistics of words: Finding keywords in literary texts and symbolic sequences

被引:53
作者
Carpena, P. [1 ]
Bernaola-Galvan, P. [1 ]
Hackenberg, M. [2 ]
Coronado, A. V. [1 ]
Oliver, J. L. [3 ]
机构
[1] Univ Malaga, Dept Fis Aplicada 2, E-29071 Malaga, Spain
[2] CIC BioGUNE, Bioinformat Grp, Derio 48160, Bizkaia, Spain
[3] Univ Granada, Dept Genet, E-18071 Granada, Spain
来源
PHYSICAL REVIEW E | 2009年 / 79卷 / 03期
关键词
statistical analysis; text analysis; OCCURRENCES; DISTANCES; MODELS;
D O I
10.1103/PhysRevE.79.035102
中图分类号
O35 [流体力学]; O53 [等离子体物理学];
学科分类号
070204 ; 080103 ; 080704 ;
摘要
Using a generalization of the level statistics analysis of quantum disordered systems, we present an approach able to extract automatically keywords in literary texts. Our approach takes into account not only the frequencies of the words present in the text but also their spatial distribution along the text, and is based on the fact that relevant words are significantly clustered (i.e., they self-attract each other), while irrelevant words are distributed randomly in the text. Since a reference corpus is not needed, our approach is especially suitable for single documents for which no a priori information is available. In addition, we show that our method works also in generic symbolic sequences (continuous texts without spaces), thus suggesting its general applicability.
引用
收藏
页数:4
相关论文
共 27 条
[1]  
[Anonymous], INTRO ALGORITHMS
[2]  
[Anonymous], NIST SPECIAL PUBLICA
[3]  
Berger A, 1999, SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P222, DOI 10.1145/312624.312681
[4]   Statistical techniques for text classification based on word recurrence intervals [J].
Berryman, MJ ;
Allison, A ;
Abbott, D .
FLUCTUATION AND NOISE LETTERS, 2003, 3 (01) :L1-L10
[5]   DECISION THEORETIC FOUNDATION FOR INDEXING [J].
BOOKSTEIN, A ;
SWANSON, DR .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1975, 26 (01) :45-50
[6]   PROBABILISTIC MODELS FOR AUTOMATIC INDEXING [J].
BOOKSTEIN, A .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1974, 25 (05) :312-318
[7]   RANDOM-MATRIX PHYSICS - SPECTRUM AND STRENGTH FLUCTUATIONS [J].
BRODY, TA ;
FLORES, J ;
FRENCH, JB ;
MELLO, PA ;
PANDEY, A ;
WONG, SSM .
REVIEWS OF MODERN PHYSICS, 1981, 53 (03) :385-479
[8]   Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis [J].
Bussemaker, HJ ;
Li, H ;
Siggia, ED .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (18) :10096-10100
[9]   New class of level statistics in correlated disordered chains -: art. no. 176804 [J].
Carpena, P ;
Bernaola-Galván, P ;
Ivanov, PC .
PHYSICAL REVIEW LETTERS, 2004, 93 (17) :176804-1
[10]   Getting started in text mining [J].
Cohen, K. Bretonnel ;
Hunter, Lawrence .
PLOS COMPUTATIONAL BIOLOGY, 2008, 4 (01) :0001-0003