Automated Detection of Malevolent Domains in Cyberspace Using Natural Language Processing and Machine Learning

被引:0
|
作者
Samad, Saleem Raja Abdul [1 ]
Ganesan, Pradeepa [1 ]
Al-Kaabi, Amna Salim [1 ]
Rajasekaran, Justin
Singaravelan, M. [2 ]
Basha, Peerbasha Shebbeer [3 ]
机构
[1] Univ Technol & Appl Sci Ibri, Coll Comp & Informat Sci, IT Dept, Shinas, Oman
[2] Vel Tech Rangarajan Dr Sagunthala R&D Inst Sci & T, Dept Comp Sci & Engn, Chennai, Tamil Nadu, India
[3] Jamal Mohamed Coll, Dept Comp Sci, Tiruchirappalli, Tamil Nadu, India
关键词
Machine learning; N-gram; linguistic features; natural language processing (NLP); malicious webpage;
D O I
10.14569/IJACSA.2024.0151036
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Cyberattacks are intentional attacks on computer systems, networks, and devices. Malware, phishing, drive-by downloads, and injection are popular cyberattacks that can harm individuals, businesses, and organizations. Most of these attacks trick internet users by using malicious links or webpages. Malicious webpages can be used to distribute malware, steal personal information, conduct phishing attacks, or perform other malicious activities. Detecting such malicious websites is a tedious task for internet users. Therefore, locating such a website in cyberspace requires an automated detection tool. Currently, machine learning techniques are being used to detect such malicious websites. The majority of recent studies derive limited number of features from webpages (both benign and malicious) and use machine learning (ML) algorithms to detect fraudulent webpages. However, these constrained capabilities might not use the full potential of the dataset. This study addresses this issue by identifying malicious websites using both the URL and webpage content features. To maximize detection accuracy, both ngrams and vectorization methods in natural language processing are adopted with minimum feature-set. To exploit the full potential of the dataset, the proposed approach derives the 22 common linguistic features of the URL and generates ngrams from the domain name of the URL. The textual content of the webpages was also used. The research employs seven machine learning algorithms with three vectorization methods. The outcome reveals that the proposed method outperformed the results of previous studies.
引用
收藏
页码:328 / 341
页数:14
相关论文
共 50 条
  • [1] Automated Genre Classification of Books Using Machine Learning and Natural Language Processing
    Gupta, Shikha
    Agarwal, Mohit
    Jain, Satbir
    2019 9TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, DATA SCIENCE & ENGINEERING (CONFLUENCE 2019), 2019, : 269 - 272
  • [2] Discover Trending Domains using Fusion of Supervised Machine Learning with Natural Language Processing
    Lakhanpal, Shilpa
    Gupta, Ajay
    Agrawal, Rajeev
    2015 18TH INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION), 2015, : 893 - 900
  • [3] Social Reminiscence in Older Adults' Everyday Conversations: Automated Detection Using Natural Language Processing and Machine Learning
    Ferrario, Andrea
    Demiray, Burcu
    Yordanova, Kristina
    Luo, Minxia
    Martin, Mike
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2020, 22 (09)
  • [4] Network Intrusion Detection using Natural Language Processing and Ensemble Machine Learning
    Das, Saikat
    Ashrafuzzamant, Mohammad
    Sheldon, Frederick T.
    Shiva, Sajjan
    2020 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2020, : 829 - 835
  • [5] Detection of Fake News Using Machine Learning and Natural Language Processing Algorithms
    Prachi, Noshin Nirvana
    Habibullah, Md.
    Rafi, Md. Emanul Haque
    Alam, Evan
    Khan, Riasat
    JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, 2022, 13 (06) : 652 - 661
  • [6] Automated Research Review Support Using Machine Learning, Large Language Models, and Natural Language Processing
    Pendyala, Vishnu S.
    Kamdar, Karnavee
    Mulchandani, Kapil
    ELECTRONICS, 2025, 14 (02):
  • [7] Automated Priority Assignment of Building Maintenance Tasks Using Natural Language Processing and Machine Learning
    D'Orazio, Marco
    Bernardini, Gabriele
    Di Giuseppe, Elisa
    JOURNAL OF ARCHITECTURAL ENGINEERING, 2023, 29 (03)
  • [8] Stress detection using natural language processing and machine learning over social interactions
    Nijhawan, Tanya
    Attigeri, Girija
    Ananthakrishna, T.
    JOURNAL OF BIG DATA, 2022, 9 (01)
  • [9] Detection of Phishing in Mobile Instant Messaging using Natural Language Processing and Machine Learning
    Verma, Suman
    Ayala-Rivera, Vanessa
    Portillo-Dominguez, A. Omar
    2023 11TH INTERNATIONAL CONFERENCE IN SOFTWARE ENGINEERING RESEARCH AND INNOVATION, CONISOFT 2023, 2023, : 159 - 168
  • [10] Stress detection using natural language processing and machine learning over social interactions
    Tanya Nijhawan
    Girija Attigeri
    T. Ananthakrishna
    Journal of Big Data, 9