Text classification;
Complaint reports;
Waste management;
Word embedding;
Language model;
fastText;
BERT;
D O I:
10.1007/978-3-031-06746-4_36
中图分类号:
TP18 [人工智能理论];
学科分类号:
081104 ;
0812 ;
0835 ;
1405 ;
摘要:
The paper concerns the issue of automatic text classification of complaint letters written in Polish that were sent to the municipal waste management system operating in one of the largest Polish cities. The problem analyzed regards a multi-class classification task with information source separation. The authors compare five approaches, starting from TF-IDF, through word2vec methods, and to transformer-based BERT models. The article includes a detailed analysis of the experiments performed and the data set used. The analysis was performed according to the stratified k-fold cross-validation with 10 folds. The classification results were analyzed using three measures: precision, average F1 score, and weighted F1 score. The results obtained confirm that the BERT-based approach outperforms the other approaches. Indeed, the HerBert large model is recommended for use in similar downstream tasks in Polish.