Hybrid Machine Learning and Deep Learning Approaches for Insult Detection in Roman Urdu Text

被引:0
|
作者
Hussain, Nisar [1 ]
Qasim, Amna [1 ]
Mehak, Gull [1 ]
Kolesnikova, Olga [1 ]
Gelbukh, Alexander [1 ]
Sidorov, Grigori [1 ]
机构
[1] Inst Politecn Nacl IPN, Ctr Invest Comp CIC, Av Juan de Dios Batiz S-N, Mexico City 07320, Mexico
关键词
deep learning; machine learning; support vector machine;
D O I
10.3390/ai6020033
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Thisstudy introduces a new model for detecting insults in Roman Urdu, filling an important gap in natural language processing (NLP) for low-resource languages. The transliterated nature of Roman Urdu also poses specific challenges from a computational linguistics perspective, including non-standardized grammar, variation in spellings for the same word, and high levels of code-mixing with English, which together make automated insult detection for Roman Urdu a highly complex problem. To address these problems, we created a large-scale dataset with 46,045 labeled comments from social media websites such as Twitter, Facebook, and YouTube. This is the first dataset for insult detection for Roman Urdu that was created and annotated with insulting and non-insulting content. Advanced preprocessing methods such as text cleaning, text normalization, and tokenization are used in the study, as well as feature extraction using TF-IDF through unigram (Uni), bigram (Bi), trigram (Tri), and their unions: Uni+Bi+Trigram. We compared ten machine learning algorithms (logistic regression, support vector machines, random forest, gradient boosting, AdaBoost, and XGBoost) and three deep learning topologies (CNN, LSTM, and Bi-LSTM). Different models were compared, and ensemble ones were proven to give the highest F1-scores, reaching 97.79%, 97.78%, and 95.25%, respectively, for AdaBoost, decision tree, TF-IDF, and Uni+Bi+Trigram configurations. Deeper learning models also performed on par, with CNN achieving an F1-score of 97.01%. Overall, the results highlight the utility of n-gram features and the combination of robust classifiers in detecting insults. This study makes strides in improving NLP for Roman Urdu, yet further research has established the foundation of pre-trained transformers and hybrid approaches; this could overcome existing systems and platform limitations. This study has conscious implications, mainly on the construction of automated moderation tools to achieve safer online spaces, especially for South Asian social media websites.
引用
收藏
页数:16
相关论文
共 50 条
  • [41] Cyberbullying Detection for Urdu Language Using Machine Learning
    Mustafa, Hamza
    Zafar, Kashif
    FORTHCOMING NETWORKS AND SUSTAINABILITY IN THE AIOT ERA, VOL 1, FONES-AIOT 2024, 2024, 1035 : 244 - 257
  • [42] A Deep Learning Approach for Automated Depression Assessment Using Roman Urdu
    Mohmand, Ruba
    Habib, Usman
    Usman, Muhammad
    Baili, Jamel
    Nam, Yunyoung
    IEEE ACCESS, 2024, 12 : 193387 - 193401
  • [43] Urdu signboard detection and recognition using deep learning
    Syed Yasser Arafat
    Nabeel Ashraf
    Muhammad Javed Iqbal
    Iftikhar Ahmad
    Suleman Khan
    Joel J. P. C. Rodrigues
    Multimedia Tools and Applications, 2022, 81 : 11965 - 11987
  • [44] Correction to: Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification
    Muhammad Nabeel Asim
    Muhammad Usman Ghani
    Muhammad Ali Ibrahim
    Waqar Mahmood
    Andreas Dengel
    Sheraz Ahmed
    Neural Computing and Applications, 2021, 33 : 2157 - 2157
  • [45] Urdu signboard detection and recognition using deep learning
    Arafat, Syed Yasser
    Ashraf, Nabeel
    Iqbal, Muhammad Javed
    Ahmad, Iftikhar
    Khan, Suleman
    Rodrigues, Joel J. P. C.
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (09) : 11965 - 11987
  • [46] Machine and Deep Learning Based Comparative Analysis Using Hybrid Approaches for Intrusion Detection System
    Rashid, Azam
    Siddique, Muhammad Jawaid
    Ahmed, Shahid Munir
    2020 3RD INTERNATIONAL CONFERENCE ON ADVANCEMENTS IN COMPUTATIONAL SCIENCES (ICACS), 2020,
  • [47] UTSA: Urdu Text Sentiment Analysis Using Deep Learning Methods
    Naqvi, Uzma
    Majid, Abdul
    Abbas, Syed Ali
    IEEE ACCESS, 2021, 9 : 114085 - 114094
  • [48] UTSA: Urdu Text Sentiment Analysis Using Deep Learning Methods
    Naqvi, Uzma
    Majid, Abdul
    Abbas, Syed Ali
    IEEE Access, 2021, 9 : 114085 - 114094
  • [49] Usefulness of machine learning and deep learning approaches in screening and early detection of breast cancer
    Ghorbian, Mohsen
    Ghorbian, Saeid
    HELIYON, 2023, 9 (12)
  • [50] A comprehensive review on detection of plant disease using machine learning and deep learning approaches
    Jackulin C.
    Murugavalli S.
    Measurement: Sensors, 2022, 24