Roman Urdu News Headline Classification Empowered with Machine Learning

被引:9
|
作者
Naqvi, Rizwan Ali [1 ]
Khan, Muhammad Adnan [2 ]
Malik, Nauman [2 ]
Saqib, Shazia [2 ]
Alyas, Tahir [2 ]
Hussain, Dildar [3 ]
机构
[1] Sejong Univ, Dept Unmanned Vehicle Engn, Seoul 05006, South Korea
[2] Lahore Garrison Univ, Dept Comp Sci, Lahore 54000, Pakistan
[3] Korea Inst Adv Study, Sch Computat Sci, Seoul 02455, South Korea
来源
CMC-COMPUTERS MATERIALS & CONTINUA | 2020年 / 65卷 / 02期
关键词
Roman urdu; news headline classification; long short term memory; recurrent neural network; logistic regression; multinomial naive Bayes; random forest; k neighbor; gradient boosting classifier; SENTIMENT ANALYSIS;
D O I
10.32604/cmc.2020.011686
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Roman Urdu has been used for text messaging over the Internet for years especially in Indo-Pak Subcontinent. Persons from the subcontinent may speak the same Urdu language but they might be using different scripts for writing. The communication using the Roman characters, which are used in the script of Urdu language on social media, is now considered the most typical standard of communication in an Indian landmass that makes it an expensive information supply. English Text classification is a solved problem but there have been only a few efforts to examine the rich information supply of Roman Urdu in the past. This is due to the numerous complexities involved in the processing of Roman Urdu data. The complexities associated with Roman Urdu include the non-availability of the tagged corpus, lack of a set of rules, and lack of standardized spellings. A large amount of Roman Urdu news data is available on mainstream news websites and social media websites like Facebook, Twitter but meaningful information can only be extracted if data is in a structured format. We have developed a Roman Urdu news headline classifier, which will help to classify news into relevant categories on which further analysis and modeling can be done. The author of this research aims to develop the Roman Urdu news classifier, which will classify the news into five categories (health, business, technology, sports, international). First, we will develop the news dataset using scraping tools and then after preprocessing, we will compare the results of different machine learning algorithms like Logistic Regression (LR), Multinomial Naive Bayes (MNB), Long short term memory (LSTM), and Convolutional Neural Network (CNN). After this, we will use a phonetic algorithm to control lexical variation and test news from different websites. The preliminary results suggest that a more accurate classification can be accomplished by monitoring noise inside data and by classifying the news. After applying above mentioned different machine learning algorithms, results have shown that Multinomial Naive Bayes classifier is giving the best accuracy of 90.17% which is due to the noise lexical variation.
引用
收藏
页码:1221 / 1236
页数:16
相关论文
共 50 条
  • [21] An Unsupervised Approach for Sentiment Analysis on Social Media Short Text Classification in Roman Urdu Sentiment analysis on short text classification in Roman Urdu
    Rana, Toqir A.
    Shahzadi, Kiran
    Rana, Tauseef
    Arshad, Ahsan
    Tubishat, Mohammad
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (02)
  • [22] A Semantic Representation Enhancement Method for Chinese News Headline Classification
    Yin, Zhongbo
    Tang, Jintao
    Ru, Chengsen
    Luo, Wei
    Luo, Zhunchen
    Ma, Xiaolei
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2017, 2018, 10619 : 318 - 328
  • [23] Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization
    Ashiq, Waqar
    Kanwal, Samra
    Rafique, Adnan
    Waqas, Muhammad
    Khurshaid, Tahir
    Montero, Elizabeth Caro
    Alonso, Alicia Bustamante
    Ashraf, Imran
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [24] Explainable Machine Learning Models for Swahili News Classification
    Murindanyi, Sudi
    Brian, Yiiki Afedra
    Katumba, Andrew
    Nakatumba-Nabende, Joyce
    PROCEEDINGS OF 2023 7TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, NLPIR 2023, 2023, : 12 - 18
  • [25] ONLINE NEWS CLASSIFICATION USING MACHINE LEARNING TECHNIQUES
    Ahmed, Jeelani
    Ahmed, Muqeem
    IIUM ENGINEERING JOURNAL, 2021, 22 (02): : 210 - 225
  • [26] Application of Machine Learning Techniques for Fake News Classification
    Silva, Kim
    Paixao, Crysttian
    Rodrigues, Paulo Canas
    MEASUREMENT-INTERDISCIPLINARY RESEARCH AND PERSPECTIVES, 2024,
  • [27] Developing Machine Learning Models to Automate News Classification
    Singh, Roshan
    Chun, Soon Ae
    Atluri, Vijay
    PROCEEDINGS OF THE 21ST ANNUAL INTERNATIONAL CONFERENCE ON DIGITAL GOVERNMENT RESEARCH, DGO 2020, 2020, : 354 - 355
  • [28] Multilevel Classification of Pakistani News using Machine Learning
    Ilyas, Anum
    Obaid, Surayya
    Bawany, Narmeen Zakaria
    2021 22ND INTERNATIONAL ARAB CONFERENCE ON INFORMATION TECHNOLOGY (ACIT), 2021, : 760 - 764
  • [29] Detecting Empathy in Roman Urdu Using Transfer Learning
    Sattar, Hafsa
    Munir, Mubasher
    Malik, Muhammad Kamran
    Nasar, Zara
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2025, 24 (02)
  • [30] Roman Urdu Sentiment Analysis Using Transfer Learning
    Li, Dun
    Ahmed, Kanwal
    Zheng, Zhiyun
    Mohsan, Syed Agha Hassnain
    Alsharif, Mohammed H.
    Hadjouni, Myriam
    Jamjoom, Mona M.
    Mostafa, Samih M.
    APPLIED SCIENCES-BASEL, 2022, 12 (20):