Hybrid Machine Learning and Deep Learning Approaches for Insult Detection in Roman Urdu Text

被引：0

作者：

Hussain, Nisar ^{[1
]}

Qasim, Amna ^{[1
]}

Mehak, Gull ^{[1
]}

Kolesnikova, Olga ^{[1
]}

Gelbukh, Alexander ^{[1
]}

Sidorov, Grigori ^{[1
]}

机构：

[1] Inst Politecn Nacl IPN, Ctr Invest Comp CIC, Av Juan de Dios Batiz S-N, Mexico City 07320, Mexico

来源：

AI | 2025年 / 6卷 / 02期

关键词：

deep learning; machine learning; support vector machine;

D O I：

10.3390/ai6020033

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Thisstudy introduces a new model for detecting insults in Roman Urdu, filling an important gap in natural language processing (NLP) for low-resource languages. The transliterated nature of Roman Urdu also poses specific challenges from a computational linguistics perspective, including non-standardized grammar, variation in spellings for the same word, and high levels of code-mixing with English, which together make automated insult detection for Roman Urdu a highly complex problem. To address these problems, we created a large-scale dataset with 46,045 labeled comments from social media websites such as Twitter, Facebook, and YouTube. This is the first dataset for insult detection for Roman Urdu that was created and annotated with insulting and non-insulting content. Advanced preprocessing methods such as text cleaning, text normalization, and tokenization are used in the study, as well as feature extraction using TF-IDF through unigram (Uni), bigram (Bi), trigram (Tri), and their unions: Uni+Bi+Trigram. We compared ten machine learning algorithms (logistic regression, support vector machines, random forest, gradient boosting, AdaBoost, and XGBoost) and three deep learning topologies (CNN, LSTM, and Bi-LSTM). Different models were compared, and ensemble ones were proven to give the highest F1-scores, reaching 97.79%, 97.78%, and 95.25%, respectively, for AdaBoost, decision tree, TF-IDF, and Uni+Bi+Trigram configurations. Deeper learning models also performed on par, with CNN achieving an F1-score of 97.01%. Overall, the results highlight the utility of n-gram features and the combination of robust classifiers in detecting insults. This study makes strides in improving NLP for Roman Urdu, yet further research has established the foundation of pre-trained transformers and hybrid approaches; this could overcome existing systems and platform limitations. This study has conscious implications, mainly on the construction of automated moderation tools to achieve safer online spaces, especially for South Asian social media websites.

引用

页数：16

共 50 条

[21] Urdu Word Segmentation using Machine Learning Approaches
Khan, Sadiq Nawaz
Khan, Khairullah
Khan, Asfandyar
Khan, Wahab
Subhan, Fazali
Khan, Aman Ullah
Ullah, Burhan
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2018, 9 (06) : 193 - 200
[22] Utilizing Machine Learning and Deep Learning Approaches for the Detection of Cyberbullying Issues
Ostayeva, Aiymkhan
Kozhamkulova, Zhazira
Kozhamkulova, Zhadra
Aimakhanov, Yerkebulan
Abylkhassenova, Dina
Serik, Aisulu
Turganbay, Kuralay
Tenizbayev, Yegenberdi
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (06) : 1154 - 1161
[23] The Role of Machine Learning and Deep Learning Approaches for the Detection of Skin Cancer
Mazhar, Tehseen
Haq, Inayatul
Ditta, Allah
Mohsan, Syed Agha Hassnain
Rehman, Faisal
Zafar, Imran
Gansau, Jualang Azlan
Goh, Lucky Poh Wah
HEALTHCARE, 2023, 11 (03)
[24] Exploring Deep Learning and Machine Learning Approaches for Brain Hemorrhage Detection
Ahmed, Samia
Esha, Jannatul Ferdous
Rahman, Md. Sazzadur
Kaiser, M. Shamim
Hosen, A. S. M. Sanwar
Ghimire, Deepak
Park, Mi Jin
IEEE ACCESS, 2024, 12 : 45060 - 45093
[25] A Study: Machine Learning and Deep Learning Approaches for Intrusion Detection System
Sekhar, C. H.
Rao, K. Venkata
SECOND INTERNATIONAL CONFERENCE ON COMPUTER NETWORKS AND COMMUNICATION TECHNOLOGIES, ICCNCT 2019, 2020, 44 : 845 - 849
[26] Roman Urdu News Headline Classification Empowered with Machine Learning
Naqvi, Rizwan Ali
Khan, Muhammad Adnan
Malik, Nauman
Saqib, Shazia
Alyas, Tahir
Hussain, Dildar
CMC-COMPUTERS MATERIALS & CONTINUA, 2020, 65 (02): : 1221 - 1236
[27] Comparative Study between Traditional Machine Learning and Deep Learning Approaches for Text Classification
Kamath, Cannannore Nidhi
Bukhari, Syed Saqib
Dengel, Andreas
PROCEEDINGS OF THE ACM SYMPOSIUM ON DOCUMENT ENGINEERING (DOCENG 2018), 2018,
[28] Urdu-Text Detection and Recognition in Natural Scene Images Using Deep Learning
Arafat, Syed Yasser
Iqbal, Muhammad Javed
IEEE ACCESS, 2020, 8 : 96787 - 96803
[29] Ransomware Detection using Machine and Deep Learning Approaches
Alsaidi, Ramadhan A. M.
Yafooz, Wael M. S.
Alolofi, Hashem
Taufiq-Hail, Ghilan Al-Madhagy
Emara, Abdel-Hamid M.
Abdel-Wahab, Ahmed
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (11) : 112 - 119
[30] Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification
Muhammad Nabeel Asim
Muhammad Usman Ghani
Muhammad Ali Ibrahim
Waqar Mahmood
Andreas Dengel
Sheraz Ahmed
Neural Computing and Applications, 2021, 33 : 5437 - 5469

← 1 2 3 4 5 →