Integrating Social and Auxiliary Semantics for Multifaceted Topic Modeling in Twitter

被引:22
作者
Vosecky, Jan [1 ]
Jiang, Di [1 ]
Leung, Kenneth Wai-Ting [1 ]
Xing, Kai [1 ]
Ng, Wilfred [1 ]
机构
[1] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Kowloon, Hong Kong, Peoples R China
关键词
Algorithms; Experimentation; Social media; topic model; unsupervised learning; semantic enrichment;
D O I
10.1145/2651403
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Microblogging platforms, such as Twitter, have already played an important role in recent cultural, social and political events. Discovering latent topics from social streams is therefore important for many downstream applications, such as clustering, classification or recommendation. However, traditional topic models that rely on the bag-of-words assumption are insufficient to uncover the rich semantics and temporal aspects of topics in Twitter. In particular, microblog content is often influenced by external information sources, such as Web documents linked from Twitter posts, and often focuses on specific entities, such as people or organizations. These external sources provide useful semantics to understand microblogs and we generally refer to these semantics as auxiliary semantics. In this article, we address the mentioned issues and propose a unified framework for Multifaceted Topic Modeling from Twitter streams. We first extract social semantics from Twitter by modeling the social chatter associated with hashtags. We further extract terms and named entities from linked Web documents to serve as auxiliary semantics during topic modeling. The Multifaceted Topic Model (MfTM) is then proposed to jointly model latent semantics among the social terms from Twitter, auxiliary terms from the linked Web documents and named entities. Moreover, we capture the temporal characteristics of each topic. An efficient online inference method for MfTM is developed, which enables our model to be applied to large-scale and streaming data. Our experimental evaluation shows the effectiveness and efficiency of our model compared with state-of-the-art baselines. We evaluate each aspect of our framework and show its utility in the context of tweet clustering.
引用
收藏
页数:24
相关论文
共 31 条
[1]  
Abel F, 2011, LECT NOTES COMPUT SC, V6644, P375, DOI 10.1007/978-3-642-21064-8_26
[2]   On-Line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking [J].
AlSumait, Loulwah ;
Barbara, Daniel ;
Domeniconi, Carlotta .
ICDM 2008: EIGHTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2008, :3-12
[3]  
[Anonymous], 2010, HT 10 P 21 ACM C HYP
[4]  
[Anonymous], 2012, P COLING 2012
[5]  
[Anonymous], 2011, P 20 ACM INT C INF K, DOI 10.1145/2063576.2063726
[6]  
[Anonymous], 2008, Introduction to information retrieval
[7]  
[Anonymous], 2010, P 3 ACM INT C WEB SE, DOI DOI 10.1145/1718487.1718524
[8]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[9]  
Celik I, 2011, LECT NOTES COMPUT SC, V6757, P167, DOI 10.1007/978-3-642-22233-7_12
[10]  
Duan Y, 2010, P 23 INT C COMP LING, P295