DEVELOPMENT OF COMPUTATIONAL LINGUISTIC RESOURCES FOR AUTOMATED DETECTION OF TEXTUAL CYBERBULLYING THREATS IN ROMAN URDU LANGUAGE

被引:17
|
作者
Dewani, Amirita [1 ]
Memon, Mohsin Ali [1 ]
Bhatti, Sania [1 ]
机构
[1] Mehran Univ Engn & Technol, Jamshoro, Sindh, Pakistan
来源
3C TIC | 2021年 / 10卷 / 02期
关键词
Linguistic Resources; Cyberaggression; Cyberbullying; Hate Speech Detection; Abusive Language Automated Detection;
D O I
10.17993/3ctic.2021.102.101-121
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Automatic Cyberbullying detection has remained very challenging task since social media content and conversations are usually posted in unstructured free-text form leaving behind the language norms. The major concern and gap in formulating cyberbullying detection strategies is scarcity of available linguistic resources typically for newly evolved languages. Roman Urdu has recently emerged and hence is a resource poor language. Urdu has been widely known as the national language of Pakistan. However, because of socio-cultural and multilingual aspects, Roman Urdu is used widely on the Internet by Asians and more specifically Pakistanis. To fulfil the above stated gap, this research work presents guidelines for data annotation process and developed two linguistic resources: (i) Annotated corpus in Roman Urdu Language for cyberaggression and offensive language detection. The process of data annotation involved bilingual annotators instead of crowdsourcing. It has the benefit of correctly annotating instances that constitute clear cases of cyberbullying without compromising data quality. The developed corpus is highly balanced (with almost negligible skew) unlike most of the existing corpuses even in mature languages. (ii) Processing textual information for NLP tasks involves Stop-word elimination as a sub phase. Stop words carry least semantic information and increase feature space as compared to the other tokens and index terms in corpora. We have developed domain specific stop words for Roman Urdu Language considering all the lexical variants and typically in the context of aggression detection and collected data. The work has been carried out using python programming language and Pycharm IDE.
引用
收藏
页码:101 / 121
页数:21
相关论文
共 21 条
  • [1] Automatic Detection of Offensive Language for Urdu and Roman Urdu
    Akhter, Muhammad Pervez
    Zheng Jiangbin
    Naqvi, Irfan Raza
    Abdelmajeed, Mohammed
    Sadiq, Muhammad Tariq
    IEEE ACCESS, 2020, 8 (08): : 91213 - 91226
  • [2] Multilingual Detection of Cyberbullying in Mixed Urdu, Roman Urdu, and English Social Media Conversations
    Razi, Fakhra
    Ejaz, Naveed
    IEEE ACCESS, 2024, 12 : 105201 - 105210
  • [3] Cyberbullying Detection for Urdu Language Using Machine Learning
    Mustafa, Hamza
    Zafar, Kashif
    FORTHCOMING NETWORKS AND SUSTAINABILITY IN THE AIOT ERA, VOL 1, FONES-AIOT 2024, 2024, 1035 : 244 - 257
  • [4] Cyberbullying Detection and Abuser Profile Identification on Social Media for Roman Urdu
    Atif, Ayesha
    Zafar, Amna
    Wasim, Muhammad
    Waheed, Talha
    Ali, Amjad
    Ali, Hazrat
    Shah, Zubair
    IEEE ACCESS, 2024, 12 : 123339 - 123351
  • [5] Cyberbullying detection: advanced preprocessing techniques & deep learning architecture for Roman Urdu data
    Amirita Dewani
    Mohsin Ali Memon
    Sania Bhatti
    Journal of Big Data, 8
  • [6] Cyberbullying detection: advanced preprocessing techniques & deep learning architecture for Roman Urdu data
    Dewani, Amirita
    Memon, Mohsin Ali
    Bhatti, Sania
    JOURNAL OF BIG DATA, 2021, 8 (01)
  • [7] Hate-Speech and Offensive Language Detection in Roman Urdu
    Rizwan, Hammad
    Shakeel, Muhammad Haroon
    Karim, Asim
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2512 - 2522
  • [8] Extension of Semantic Based Urdu Linguistic Resources Using Natural Language Processing
    Khalid, Komal
    Afzal, Hammad
    Moqaddas, Faiza
    Iltaf, Naima
    Sheri, Ahmed Muqeem
    Nawaz, Raheel
    2017 IEEE 15TH INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, 15TH INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, 3RD INTL CONF ON BIG DATA INTELLIGENCE AND COMPUTING AND CYBER SCIENCE AND TECHNOLOGY CONGRESS(DASC/PICOM/DATACOM/CYBERSCI, 2017, : 1322 - 1325
  • [9] Detection of Cyberbullying Patterns in Low Resource Colloquial Roman Urdu Microtext using Natural Language Processing, Machine Learning, and Ensemble Techniques
    Dewani, Amirita
    Memon, Mohsin Ali
    Bhatti, Sania
    Sulaiman, Adel
    Hamdi, Mohammed
    Alshahrani, Hani
    Alghamdi, Abdullah
    Shaikh, Asadullah
    APPLIED SCIENCES-BASEL, 2023, 13 (04):
  • [10] Natural language processing with few computational linguistic resources: An experiment with automatic sentence parsing for amharic texts
    Alemu, A
    Asker, L
    7TH WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL V, PROCEEDINGS: COMPUTER SCIENCE AND ENGINEERING: I, 2003, : 51 - 56