Social Media Topic Classification on Greek Reddit

被引:0
|
作者
Mastrokostas, Charalampos [1 ]
Giarelis, Nikolaos [1 ]
Karacapilidis, Nikos [1 ]
机构
[1] Univ Patras, Ind Management & Informat Syst Lab, MEAD, Rion 26504, Greece
关键词
Greek language; deep learning; large language models; machine learning; natural language processing; transformers; text classification; Greek NLP resources; social media;
D O I
10.3390/info15090521
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text classification (TC) is a subtask of natural language processing (NLP) that categorizes text pieces into predefined classes based on their textual content and thematic aspects. This process typically includes the training of a Machine Learning (ML) model on a labeled dataset, where each text example is associated with a specific class. Recent progress in Deep Learning (DL) enabled the development of deep neural transformer models, surpassing traditional ML ones. In any case, works of the topic classification literature prioritize high-resource languages, particularly English, while research efforts for low-resource ones, such as Greek, are limited. Taking the above into consideration, this paper presents: (i) the first Greek social media topic classification dataset; (ii) a comparative assessment of a series of traditional ML models trained on this dataset, utilizing an array of text vectorization methods including TF-IDF, classical word and transformer-based Greek embeddings; (iii) a fine-tuned GREEK-BERT-based TC model on the same dataset; (iv) key empirical findings demonstrating that transformer-based embeddings significantly increase the performance of traditional ML models, while our fine-tuned DL model outperforms previous ones. The dataset, the best-performing model, and the experimental code are made public, aiming to augment the reproducibility of this work and advance future research in the field.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] TOPIC LANDSCAPE ANALYSIS OF REDDIT SOCIAL MEDIA SUBMISSIONS IN INSOMNIA
    Meng, W.
    Qureshi, Z.
    Khandker, R.
    VALUE IN HEALTH, 2021, 24 : S171 - S171
  • [2] Automatic social media news classification: a topic modeling approach
    Amador, Daniel
    Gamboa-Venegas, Carlos
    Garcia, Ernesto
    Segura-Castillo, Andres
    TECNOLOGIA EN MARCHA, 2022, 35
  • [3] Reddit: the rancorous rise of a social-media phenomenon
    Timo Hannay
    Nature, 2018, 562 (7725) : 34 - 35
  • [4] Topic Extraction in Social Media
    Rafea, Ahmed
    Mostafa, Nada A.
    PROCEEDINGS OF THE 2013 INTERNATIONAL CONFERENCE ON COLLABORATION TECHNOLOGIES AND SYSTEMS (CTS), 2013, : 94 - 98
  • [5] Gifted Education on Reddit: A Social Media Sentiment Analysis
    Hodges, Jaret
    Simonsen, Mary
    Ottwein, Jessica
    GIFTED CHILD QUARTERLY, 2022, 66 (04) : 296 - 315
  • [6] A Social Media Intervention: Addressing Misinformation and Misdiagnoses on Reddit
    Arrindell, Deborah
    Boodram, Shan
    Boswell, Alanna
    Casey, Gillian
    Huang, Amy
    Padilla, Gabriela
    Park, Ina
    Wheeler, Ellie
    Wyand, Fred
    SEXUALLY TRANSMITTED DISEASES, 2022, 49 (10S) : S40 - S40
  • [7] Examining Information on Social Media: Topic Modelling, Trend Prediction and Community Classification
    Fang, Anjie
    SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 1377 - 1377
  • [8] Sentiment Classification of Social Media Content with Features Generated Using Topic Models
    Blair, Stuart J.
    Bi, Yaxin
    Mulvenna, Maurice D.
    PROCEEDINGS OF THE EIGHTH EUROPEAN STARTING AI RESEARCHER SYMPOSIUM (STAIRS 2016), 2016, 284 : 155 - 166
  • [9] Topic Classification in Social Media Using Metadata from Hyper linked Objects
    Kinsella, Sheila
    Passant, Alexandre
    Breslin, John G.
    ADVANCES IN INFORMATION RETRIEVAL, 2011, 6611 : 201 - 206
  • [10] Topic Modeling Applied to Reddit Posts
    Kedzierska, Maria
    Spytek, Mikolaj
    Kurek, Marcelina
    Sawicki, Jan
    Ganzha, Maria
    Paprzycki, Marcin
    BIG DATA ANALYTICS IN ASTRONOMY, SCIENCE, AND ENGINEERING, BDA 2023, 2024, 14516 : 17 - 44