DEVELOPMENT OF COMPUTATIONAL LINGUISTIC RESOURCES FOR AUTOMATED DETECTION OF TEXTUAL CYBERBULLYING THREATS IN ROMAN URDU LANGUAGE

被引：17

作者：

Dewani, Amirita ^{[1
]}

Memon, Mohsin Ali ^{[1
]}

Bhatti, Sania ^{[1
]}

机构：

[1] Mehran Univ Engn & Technol, Jamshoro, Sindh, Pakistan

来源：

3C TIC | 2021年 / 10卷 / 02期

关键词：

Linguistic Resources; Cyberaggression; Cyberbullying; Hate Speech Detection; Abusive Language Automated Detection;

D O I：

10.17993/3ctic.2021.102.101-121

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Automatic Cyberbullying detection has remained very challenging task since social media content and conversations are usually posted in unstructured free-text form leaving behind the language norms. The major concern and gap in formulating cyberbullying detection strategies is scarcity of available linguistic resources typically for newly evolved languages. Roman Urdu has recently emerged and hence is a resource poor language. Urdu has been widely known as the national language of Pakistan. However, because of socio-cultural and multilingual aspects, Roman Urdu is used widely on the Internet by Asians and more specifically Pakistanis. To fulfil the above stated gap, this research work presents guidelines for data annotation process and developed two linguistic resources: (i) Annotated corpus in Roman Urdu Language for cyberaggression and offensive language detection. The process of data annotation involved bilingual annotators instead of crowdsourcing. It has the benefit of correctly annotating instances that constitute clear cases of cyberbullying without compromising data quality. The developed corpus is highly balanced (with almost negligible skew) unlike most of the existing corpuses even in mature languages. (ii) Processing textual information for NLP tasks involves Stop-word elimination as a sub phase. Stop words carry least semantic information and increase feature space as compared to the other tokens and index terms in corpora. We have developed domain specific stop words for Roman Urdu Language considering all the lexical variants and typically in the context of aggression detection and collected data. The work has been carried out using python programming language and Pycharm IDE.

引用

页码：101 / 121

页数：21

共 21 条

[1] Automatic Detection of Offensive Language for Urdu and Roman Urdu
Akhter, Muhammad Pervez
Zheng Jiangbin
Naqvi, Irfan Raza
Abdelmajeed, Mohammed
Sadiq, Muhammad Tariq
IEEE ACCESS, 2020, 8 (08): : 91213 - 91226
[2] Multilingual Detection of Cyberbullying in Mixed Urdu, Roman Urdu, and English Social Media Conversations
Razi, Fakhra
Ejaz, Naveed
IEEE ACCESS, 2024, 12 : 105201 - 105210
[3] Cyberbullying Detection for Urdu Language Using Machine Learning
Mustafa, Hamza
Zafar, Kashif
FORTHCOMING NETWORKS AND SUSTAINABILITY IN THE AIOT ERA, VOL 1, FONES-AIOT 2024, 2024, 1035 : 244 - 257
[4] Cyberbullying Detection and Abuser Profile Identification on Social Media for Roman Urdu
Atif, Ayesha
Zafar, Amna
Wasim, Muhammad
Waheed, Talha
Ali, Amjad
Ali, Hazrat
Shah, Zubair
IEEE ACCESS, 2024, 12 : 123339 - 123351
[5] Cyberbullying detection: advanced preprocessing techniques & deep learning architecture for Roman Urdu data
Amirita Dewani
Mohsin Ali Memon
Sania Bhatti
Journal of Big Data, 8
[6] Cyberbullying detection: advanced preprocessing techniques & deep learning architecture for Roman Urdu data
Dewani, Amirita
Memon, Mohsin Ali
Bhatti, Sania
JOURNAL OF BIG DATA, 2021, 8 (01)
[7] Hate-Speech and Offensive Language Detection in Roman Urdu
Rizwan, Hammad
Shakeel, Muhammad Haroon
Karim, Asim
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2512 - 2522
[8] Extension of Semantic Based Urdu Linguistic Resources Using Natural Language Processing
Khalid, Komal
Afzal, Hammad
Moqaddas, Faiza
Iltaf, Naima
Sheri, Ahmed Muqeem
Nawaz, Raheel
2017 IEEE 15TH INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, 15TH INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, 3RD INTL CONF ON BIG DATA INTELLIGENCE AND COMPUTING AND CYBER SCIENCE AND TECHNOLOGY CONGRESS(DASC/PICOM/DATACOM/CYBERSCI, 2017, : 1322 - 1325
[9] Detection of Cyberbullying Patterns in Low Resource Colloquial Roman Urdu Microtext using Natural Language Processing, Machine Learning, and Ensemble Techniques
Dewani, Amirita
Memon, Mohsin Ali
Bhatti, Sania
Sulaiman, Adel
Hamdi, Mohammed
Alshahrani, Hani
Alghamdi, Abdullah
Shaikh, Asadullah
APPLIED SCIENCES-BASEL, 2023, 13 (04):
[10] Natural language processing with few computational linguistic resources: An experiment with automatic sentence parsing for amharic texts
Alemu, A
Asker, L
7TH WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL V, PROCEEDINGS: COMPUTER SCIENCE AND ENGINEERING: I, 2003, : 51 - 56

← 1 2 3 →