Hybrid Machine Learning and Deep Learning Approaches for Insult Detection in Roman Urdu Text

被引:0
|
作者
Hussain, Nisar [1 ]
Qasim, Amna [1 ]
Mehak, Gull [1 ]
Kolesnikova, Olga [1 ]
Gelbukh, Alexander [1 ]
Sidorov, Grigori [1 ]
机构
[1] Inst Politecn Nacl IPN, Ctr Invest Comp CIC, Av Juan de Dios Batiz S-N, Mexico City 07320, Mexico
关键词
deep learning; machine learning; support vector machine;
D O I
10.3390/ai6020033
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Thisstudy introduces a new model for detecting insults in Roman Urdu, filling an important gap in natural language processing (NLP) for low-resource languages. The transliterated nature of Roman Urdu also poses specific challenges from a computational linguistics perspective, including non-standardized grammar, variation in spellings for the same word, and high levels of code-mixing with English, which together make automated insult detection for Roman Urdu a highly complex problem. To address these problems, we created a large-scale dataset with 46,045 labeled comments from social media websites such as Twitter, Facebook, and YouTube. This is the first dataset for insult detection for Roman Urdu that was created and annotated with insulting and non-insulting content. Advanced preprocessing methods such as text cleaning, text normalization, and tokenization are used in the study, as well as feature extraction using TF-IDF through unigram (Uni), bigram (Bi), trigram (Tri), and their unions: Uni+Bi+Trigram. We compared ten machine learning algorithms (logistic regression, support vector machines, random forest, gradient boosting, AdaBoost, and XGBoost) and three deep learning topologies (CNN, LSTM, and Bi-LSTM). Different models were compared, and ensemble ones were proven to give the highest F1-scores, reaching 97.79%, 97.78%, and 95.25%, respectively, for AdaBoost, decision tree, TF-IDF, and Uni+Bi+Trigram configurations. Deeper learning models also performed on par, with CNN achieving an F1-score of 97.01%. Overall, the results highlight the utility of n-gram features and the combination of robust classifiers in detecting insults. This study makes strides in improving NLP for Roman Urdu, yet further research has established the foundation of pre-trained transformers and hybrid approaches; this could overcome existing systems and platform limitations. This study has conscious implications, mainly on the construction of automated moderation tools to achieve safer online spaces, especially for South Asian social media websites.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Emotion Detection in Roman Urdu Text using Machine Learning
    Majeed, Adil
    Mujtaba, Hasan
    Beg, Mirza Omer
    2020 35TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING WORKSHOPS (ASEW 2020), 2020, : 125 - 130
  • [2] Contextual Urdu Text Emotion Detection Corpus and Experiments using Deep Learning Approaches
    Vardag, Muhammad Hamayon Khan
    Saeed, Ali
    Hayat, Umer
    Ullah, Muhammad Farhat
    Hussain, Naveed
    ADCAIJ-ADVANCES IN DISTRIBUTED COMPUTING AND ARTIFICIAL INTELLIGENCE JOURNAL, 2022, 11 (04): : 489 - 505
  • [3] Deep Learning-Based Sentiment Analysis for Roman Urdu Text
    Ghulam, Hussain
    Zeng, Feng
    Li, Wenjia
    Xiao, Yutong
    2018 INTERNATIONAL CONFERENCE ON IDENTIFICATION, INFORMATION AND KNOWLEDGE IN THE INTERNET OF THINGS, 2019, 147 : 131 - 135
  • [4] Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization
    Ashiq, Waqar
    Kanwal, Samra
    Rafique, Adnan
    Waqas, Muhammad
    Khurshaid, Tahir
    Montero, Elizabeth Caro
    Alonso, Alicia Bustamante
    Ashraf, Imran
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [5] Detection of Sarcasm in Urdu Tweets Using Deep Learning and Transformer Based Hybrid Approaches
    Hassan, Muhammad Ehtisham
    Hussain, Masroor
    Maab, Iffat
    Habib, Usman
    Khan, Muhammad Attique
    Masood, Anum
    IEEE ACCESS, 2024, 12 : 61542 - 61555
  • [6] Exploring deep learning approaches for Urdu text classification in product manufacturing
    Akhter, Muhammad Pervez
    Jiangbin, Zheng
    Naqvi, Irfan Raza
    Abdelmajeed, Mohammed
    Fayyaz, Muhammad
    ENTERPRISE INFORMATION SYSTEMS, 2022, 16 (02) : 223 - 248
  • [7] Part of Speech Tagging in Urdu: Comparison of Machine and Deep Learning Approaches
    Khan, Wahab
    Daud, Ali
    Khan, Khairullah
    Nasir, Jamal Abdul
    Basheri, Mohammed
    Aljohani, Naif
    Alotaibi, Fahd Saleh
    IEEE ACCESS, 2019, 7 : 38918 - 38936
  • [8] ORUD-Detect: A Comprehensive Approach to Offensive Language Detection in Roman Urdu Using Hybrid Machine Learning-Deep Learning Models with Embedding Techniques
    Hussain, Nisar
    Qasim, Amna
    Mehak, Gull
    Kolesnikova, Olga
    Gelbukh, Alexander
    Sidorov, Grigori
    INFORMATION, 2025, 16 (02)
  • [9] Deep Learning-based Roman-Urdu to Urdu Transliteration
    Alam, Mehreen
    ul Hussain, Sibt
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2021, 35 (04)
  • [10] Multihead Text Mining from COVID-19 Feedback Using Machine Learning, Deep Learning, and Hybrid Deep Learning Approaches
    Kobra, Khadijatul
    Sammi, Samrina Sarkar
    Rahman, Naimur
    Khushbu, Sharun Akter
    Islam, Mirajul
    JOURNAL OF SENSORS, 2024, 2024