A probabilistic topic model based on short distance Co-occurrences

被引：7

作者：

Rahimi, Marziea ^{[1
]}

Zahedi, Morteza ^{[1
]}

Mashayekhi, Hoda ^{[1
]}

机构：

[1] Shahrood Univ Technol, Fac Comp Engn, Shahrood 3619995161, Iran

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2022年 / 193卷

关键词：

Probabilistic topic model; Latent Dirichlet Allocation; Document clustering; Context window; Local co-occurrence; Word order; NOISY TEXT; DISCOVERY; CLASSIFICATION;

D O I：

10.1016/j.eswa.2022.116518

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A limitation of many probabilistic topic models such as Latent Dirichlet Allocation (LDA) is their inflexibility to use local contexts. As a result, these models cannot directly benefit from short-distance co-occurrences, which are more likely to be indicators of meaningful word relationships. Some models such as the Bigram Topic Model (BTM) consider local context by integrating language and topic models. However, due to taking the exact word order into account, such models suffer severely from sparseness. Some other models like Latent Dirichlet Co-Clustering (LDCC) try to solve the problem by adding another level of granularity assuming a document as a bag of segments, while ignoring the word order. In this paper, we introduce a new topic model which uses overlapping windows to encode local word relationships. In the proposed model, we assume a document is comprised of fixed-size overlapping windows, and formulate a new generative process accordingly. In the inference procedure, each word is sampled once in only a single window, while influencing the sampling of its other fellow co-occurring words in other windows. Word relationships are discovered in the document level, but the topic of each word is derived considering only its neighbor words in a window, to emphasize local word relationships. By using overlapping windows, without assuming an explicit dependency between adjacent words, we avoid ignoring the word order completely. The proposed model is straightforward, not severely prone to sparseness and as the experimental results show, produces more meaningful and more coherent topics compared to the three mentioned established models.

引用

页数：14

共 50 条

[1] Co-occurrences of / and /
Badiou-Monferran, Claire
Capin, Daniela
CEDILLE-REVISTA DE ESTUDIOS FRANCESES, 2021, (19): : 89 - 125
[2] Word discrimination based on bigram co-occurrences
El-Nasan, A
Veeramachaneni, S
Nagy, G
SIXTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, PROCEEDINGS, 2001, : 149 - 153
[3] Image estimation of words based on adjective Co-occurrences
Shimizu, Kouhei
Hagiwara, Masafumi
Systems and Computers in Japan, 2007, 38 (12) : 14 - 24
[4] Chinese POS tagging based on bilexical co-occurrences
Cao, HL
Zhao, TJ
Li, S
Sun, J
Zhang, CX
PROCEEDINGS OF 2005 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-9, 2005, : 3766 - 3769
[5] Learning Hidden Markov Models from Pairwise Co-occurrences with Application to Topic Modeling
Huang, Kejun
Fu, Xiao
Sidiropoulos, Nicholas D.
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80, 2018, 80
[6] SOME CO-OCCURRENCES IN AMERICAN CLICHES
CROFT, K
TESOL QUARTERLY, 1967, 1 (02) : 47 - 49
[7] Laughter and smiling: Notes on co-occurrences
Haakana, Markku
JOURNAL OF PRAGMATICS, 2010, 42 (06) : 1499 - 1512
[8] THE MARATHI VERBAL SEQUENCES AND THEIR CO-OCCURRENCES
SOUTHWORTH, FC
LANGUAGE, 1961, 37 (02) : 201 - 208
[9] Isolating interactions from co-occurrences
Kevin Cazelles
Nature Ecology & Evolution, 2024, 8 : 184 - 185
[10] On-line handwriting recognition based on bigram co-occurrences
El-Nasan, A
Nagy, G
16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL III, PROCEEDINGS, 2002, : 740 - 743

← 1 2 3 4 5 →