Hybrid Machine Learning and Deep Learning Approaches for Insult Detection in Roman Urdu Text

被引:0
|
作者
Hussain, Nisar [1 ]
Qasim, Amna [1 ]
Mehak, Gull [1 ]
Kolesnikova, Olga [1 ]
Gelbukh, Alexander [1 ]
Sidorov, Grigori [1 ]
机构
[1] Inst Politecn Nacl IPN, Ctr Invest Comp CIC, Av Juan de Dios Batiz S-N, Mexico City 07320, Mexico
关键词
deep learning; machine learning; support vector machine;
D O I
10.3390/ai6020033
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Thisstudy introduces a new model for detecting insults in Roman Urdu, filling an important gap in natural language processing (NLP) for low-resource languages. The transliterated nature of Roman Urdu also poses specific challenges from a computational linguistics perspective, including non-standardized grammar, variation in spellings for the same word, and high levels of code-mixing with English, which together make automated insult detection for Roman Urdu a highly complex problem. To address these problems, we created a large-scale dataset with 46,045 labeled comments from social media websites such as Twitter, Facebook, and YouTube. This is the first dataset for insult detection for Roman Urdu that was created and annotated with insulting and non-insulting content. Advanced preprocessing methods such as text cleaning, text normalization, and tokenization are used in the study, as well as feature extraction using TF-IDF through unigram (Uni), bigram (Bi), trigram (Tri), and their unions: Uni+Bi+Trigram. We compared ten machine learning algorithms (logistic regression, support vector machines, random forest, gradient boosting, AdaBoost, and XGBoost) and three deep learning topologies (CNN, LSTM, and Bi-LSTM). Different models were compared, and ensemble ones were proven to give the highest F1-scores, reaching 97.79%, 97.78%, and 95.25%, respectively, for AdaBoost, decision tree, TF-IDF, and Uni+Bi+Trigram configurations. Deeper learning models also performed on par, with CNN achieving an F1-score of 97.01%. Overall, the results highlight the utility of n-gram features and the combination of robust classifiers in detecting insults. This study makes strides in improving NLP for Roman Urdu, yet further research has established the foundation of pre-trained transformers and hybrid approaches; this could overcome existing systems and platform limitations. This study has conscious implications, mainly on the construction of automated moderation tools to achieve safer online spaces, especially for South Asian social media websites.
引用
收藏
页数:16
相关论文
共 50 条
  • [11] Deep-EmoRU: mining emotions from roman urdu text using deep learning ensemble
    Majeed, Adil
    Beg, Mirza Omer
    Arshad, Umair
    Mujtaba, Hasan
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (30) : 43163 - 43188
  • [12] Deep-EmoRU: mining emotions from roman urdu text using deep learning ensemble
    Adil Majeed
    Mirza Omer Beg
    Umair Arshad
    Hasan Mujtaba
    Multimedia Tools and Applications, 2022, 81 : 43163 - 43188
  • [13] Diabetes detection based on machine learning and deep learning approaches
    Boon Feng Wee
    Saaveethya Sivakumar
    King Hann Lim
    W. K. Wong
    Filbert H. Juwono
    Multimedia Tools and Applications, 2024, 83 : 24153 - 24185
  • [14] Diabetes detection based on machine learning and deep learning approaches
    Wee, Boon Feng
    Sivakumar, Saaveethya
    Lim, King Hann
    Wong, W. K.
    Juwono, Filbert H.
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (8) : 24153 - 24185
  • [15] Machine Learning and Deep Learning Approaches for Guava Disease Detection
    K. Paramesha
    Shruti Jalapur
    Shalini Hanok
    Kiran Puttegowda
    G. Manjunatha
    Bharath Kumara
    SN Computer Science, 6 (4)
  • [16] Sentiment Analysis of Code-Mixed Roman Urdu-English Social Media Text using Deep Learning Approaches
    Younas, Aqsa
    Nasim, Raheela
    Ali, Saqib
    Wang, Guojun
    Qi, Fang
    2020 IEEE 23RD INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND ENGINEERING (CSE 2020), 2020, : 66 - 71
  • [17] Abusive Language Detection in Urdu Text: Leveraging Deep Learning and Attention Mechanism
    Khan, Atif
    Ahmed, Abrar
    Jan, Salman
    Bilal, Muhammad
    Zuhairi, Megat F.
    IEEE ACCESS, 2024, 12 : 37418 - 37431
  • [18] A machine learning approach for urdu text sentiment analysis
    Akhtar, Muhammad
    Shoukat, Rana Saud
    Rehman, Saif Ur
    MEHRAN UNIVERSITY RESEARCH JOURNAL OF ENGINEERING AND TECHNOLOGY, 2023, 42 (02) : 75 - 87
  • [19] Cyberbullying detection: advanced preprocessing techniques & deep learning architecture for Roman Urdu data
    Dewani, Amirita
    Memon, Mohsin Ali
    Bhatti, Sania
    JOURNAL OF BIG DATA, 2021, 8 (01)
  • [20] Cyberbullying detection: advanced preprocessing techniques & deep learning architecture for Roman Urdu data
    Amirita Dewani
    Mohsin Ali Memon
    Sania Bhatti
    Journal of Big Data, 8