Roman Urdu News Headline Classification Empowered with Machine Learning

被引:9
|
作者
Naqvi, Rizwan Ali [1 ]
Khan, Muhammad Adnan [2 ]
Malik, Nauman [2 ]
Saqib, Shazia [2 ]
Alyas, Tahir [2 ]
Hussain, Dildar [3 ]
机构
[1] Sejong Univ, Dept Unmanned Vehicle Engn, Seoul 05006, South Korea
[2] Lahore Garrison Univ, Dept Comp Sci, Lahore 54000, Pakistan
[3] Korea Inst Adv Study, Sch Computat Sci, Seoul 02455, South Korea
来源
CMC-COMPUTERS MATERIALS & CONTINUA | 2020年 / 65卷 / 02期
关键词
Roman urdu; news headline classification; long short term memory; recurrent neural network; logistic regression; multinomial naive Bayes; random forest; k neighbor; gradient boosting classifier; SENTIMENT ANALYSIS;
D O I
10.32604/cmc.2020.011686
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Roman Urdu has been used for text messaging over the Internet for years especially in Indo-Pak Subcontinent. Persons from the subcontinent may speak the same Urdu language but they might be using different scripts for writing. The communication using the Roman characters, which are used in the script of Urdu language on social media, is now considered the most typical standard of communication in an Indian landmass that makes it an expensive information supply. English Text classification is a solved problem but there have been only a few efforts to examine the rich information supply of Roman Urdu in the past. This is due to the numerous complexities involved in the processing of Roman Urdu data. The complexities associated with Roman Urdu include the non-availability of the tagged corpus, lack of a set of rules, and lack of standardized spellings. A large amount of Roman Urdu news data is available on mainstream news websites and social media websites like Facebook, Twitter but meaningful information can only be extracted if data is in a structured format. We have developed a Roman Urdu news headline classifier, which will help to classify news into relevant categories on which further analysis and modeling can be done. The author of this research aims to develop the Roman Urdu news classifier, which will classify the news into five categories (health, business, technology, sports, international). First, we will develop the news dataset using scraping tools and then after preprocessing, we will compare the results of different machine learning algorithms like Logistic Regression (LR), Multinomial Naive Bayes (MNB), Long short term memory (LSTM), and Convolutional Neural Network (CNN). After this, we will use a phonetic algorithm to control lexical variation and test news from different websites. The preliminary results suggest that a more accurate classification can be accomplished by monitoring noise inside data and by classifying the news. After applying above mentioned different machine learning algorithms, results have shown that Multinomial Naive Bayes classifier is giving the best accuracy of 90.17% which is due to the noise lexical variation.
引用
收藏
页码:1221 / 1236
页数:16
相关论文
共 50 条
  • [41] SHEG: summarization and headline generation of news articles using deep learning
    Rajeev Kumar Singh
    Sonia Khetarpaul
    Rohan Gorantla
    Sai Giridhar Allada
    Neural Computing and Applications, 2021, 33 : 3251 - 3265
  • [42] Optimal Weighted Extreme Learning Machine for Cybersecurity Fake News Classification
    Dutta, Ashit Kumar
    Qureshi, Basit
    Albagory, Yasser
    Alsanea, Majed
    Al Faraj, Manal
    Sait, Abdul Rahaman Wahab
    COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2023, 44 (03): : 2395 - 2409
  • [43] Comparison of Machine Learning Algorithms for Sentiment Classification on Fake News Detection
    Mahmud, Yuzi
    Shaeeali, Noor Sakinah
    Mutalib, Sofianita
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (10) : 658 - 665
  • [44] Empirical Study of Online News Classification Using Machine Learning Approaches
    Suleymanov, Umid
    Rustamov, Samir
    Zulfugarov, Murad
    Orujov, Orkhan
    Musayev, Nadir
    Alizade, Azar
    2018 IEEE 12TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT), 2018, : 152 - 157
  • [45] SHEG: summarization and headline generation of news articles using deep learning
    Singh, Rajeev Kumar
    Khetarpaul, Sonia
    Gorantla, Rohan
    Allada, Sai Giridhar
    NEURAL COMPUTING & APPLICATIONS, 2021, 33 (08): : 3251 - 3265
  • [46] A Machine Learning Framework for Automated News Article Title Classification in Albanian
    Plaku, Evis
    Jahaj, Klei
    Cela, Arben
    Civici, Nikolla
    2024 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS, INISTA, 2024,
  • [47] Machine Learning based Classification of Online News Data for Disaster Management
    Gopal, Lakshmi S.
    Prabha, Rekha
    Pullarkatt, Divya
    Ramesh, Maneesha Vinodini
    2020 IEEE GLOBAL HUMANITARIAN TECHNOLOGY CONFERENCE (GHTC), 2020,
  • [48] Machine learning-empowered sleep staging classification using multi-modality signals
    Satapathy, Santosh Kumar
    Brahma, Biswajit
    Panda, Baidyanath
    Barsocchi, Paolo
    Bhoi, Akash Kumar
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2024, 24 (01)
  • [49] AN EMPIRICAL STUDY ON THE CLASSIFICATION OF CHINESE NEWS ARTICLES BY MACHINE LEARNING AND DEEP LEARNING TECHNIQUES
    Huang, Chuen-Min
    Jiang, Yi-Jun
    PROCEEDINGS OF 2019 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS (ICMLC), 2019, : 462 - 467
  • [50] Correction to: Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification
    Muhammad Nabeel Asim
    Muhammad Usman Ghani
    Muhammad Ali Ibrahim
    Waqar Mahmood
    Andreas Dengel
    Sheraz Ahmed
    Neural Computing and Applications, 2021, 33 : 2157 - 2157