Automated Detection of Malevolent Domains in Cyberspace Using Natural Language Processing and Machine Learning

被引:0
|
作者
Samad, Saleem Raja Abdul [1 ]
Ganesan, Pradeepa [1 ]
Al-Kaabi, Amna Salim [1 ]
Rajasekaran, Justin
Singaravelan, M. [2 ]
Basha, Peerbasha Shebbeer [3 ]
机构
[1] Univ Technol & Appl Sci Ibri, Coll Comp & Informat Sci, IT Dept, Shinas, Oman
[2] Vel Tech Rangarajan Dr Sagunthala R&D Inst Sci & T, Dept Comp Sci & Engn, Chennai, Tamil Nadu, India
[3] Jamal Mohamed Coll, Dept Comp Sci, Tiruchirappalli, Tamil Nadu, India
关键词
Machine learning; N-gram; linguistic features; natural language processing (NLP); malicious webpage;
D O I
10.14569/IJACSA.2024.0151036
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Cyberattacks are intentional attacks on computer systems, networks, and devices. Malware, phishing, drive-by downloads, and injection are popular cyberattacks that can harm individuals, businesses, and organizations. Most of these attacks trick internet users by using malicious links or webpages. Malicious webpages can be used to distribute malware, steal personal information, conduct phishing attacks, or perform other malicious activities. Detecting such malicious websites is a tedious task for internet users. Therefore, locating such a website in cyberspace requires an automated detection tool. Currently, machine learning techniques are being used to detect such malicious websites. The majority of recent studies derive limited number of features from webpages (both benign and malicious) and use machine learning (ML) algorithms to detect fraudulent webpages. However, these constrained capabilities might not use the full potential of the dataset. This study addresses this issue by identifying malicious websites using both the URL and webpage content features. To maximize detection accuracy, both ngrams and vectorization methods in natural language processing are adopted with minimum feature-set. To exploit the full potential of the dataset, the proposed approach derives the 22 common linguistic features of the URL and generates ngrams from the domain name of the URL. The textual content of the webpages was also used. The research employs seven machine learning algorithms with three vectorization methods. The outcome reveals that the proposed method outperformed the results of previous studies.
引用
收藏
页码:328 / 341
页数:14
相关论文
共 50 条
  • [21] Automated prioritization of sick newborns for whole genome sequencing using clinical natural language processing and machine learning
    Bennet Peterson
    Edgar Javier Hernandez
    Charlotte Hobbs
    Sabrina Malone Jenkins
    Barry Moore
    Edwin Rosales
    Samuel Zoucha
    Erica Sanford
    Matthew N. Bainbridge
    Erwin Frise
    Albert Oriol
    Luca Brunelli
    Stephen F. Kingsmore
    Mark Yandell
    Genome Medicine, 15
  • [22] Detecting Phishing Attacks Using Natural Language Processing and Machine Learning
    Peng, Tianrui
    Harris, Ian G.
    Sawa, Yuki
    2018 IEEE 12TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2018, : 300 - 301
  • [23] Analysis of Breakdown Reports Using Natural Language Processing and Machine Learning
    Ahmed, Mobyen Uddin
    Bengtsson, Marcus
    Salonen, Antti
    Funk, Peter
    INTERNATIONAL CONGRESS AND WORKSHOP ON INDUSTRIAL AI 2021, 2022, : 40 - 52
  • [24] CATEGORIZING TELEMEDICINE VISITS USING NATURAL LANGUAGE PROCESSING AND MACHINE LEARNING
    Sudaria, T.
    Overcash, J.
    Nguyen, N.
    Oguntuga, A.
    VALUE IN HEALTH, 2022, 25 (07) : S597 - S597
  • [25] Detecting Phishing Attacks Using Natural Language Processing And Machine Learning
    Banu, Reshma
    Anand, M.
    Kamath, Akshatha C.
    Ashika, S.
    Ujwala, H. S.
    Harshitha, S. N.
    PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICCS), 2019, : 1210 - 1214
  • [26] Subjective Answers Evaluation Using Machine Learning and Natural Language Processing
    Bashir, Muhammad Farrukh
    Arshad, Hamza
    Javed, Abdul Rehman
    Kryvinska, Natalia
    Band, Shahab S.
    IEEE ACCESS, 2021, 9 : 158972 - 158983
  • [27] Leveraging Natural Language Processing and Machine Learning for Efficient Fake News Detection
    Kumar, Naresh
    Malhotra, Meetu
    Aggarwal, Bharti
    Rai, Dinesh
    Aggarwal, Gaurav
    Proceedings - International Conference on Technological Advancements in Computational Sciences, ICTACS 2023, 2023, : 535 - 541
  • [28] A Comparison of Natural Language Processing and Machine Learning Methods for Phishing Email Detection
    Bountakas, Panagiotis
    Koutroumpouchos, Konstantinos
    Xenakis, Christos
    ARES 2021: 16TH INTERNATIONAL CONFERENCE ON AVAILABILITY, RELIABILITY AND SECURITY, 2021,
  • [29] Automated Detection of Radiology Reports that Require Follow-up Imaging Using Natural Language Processing Feature Engineering and Machine Learning Classification
    Robert Lou
    Darco Lalevic
    Charles Chambers
    Hanna M. Zafar
    Tessa S. Cook
    Journal of Digital Imaging, 2020, 33 : 131 - 136
  • [30] Automated Detection of Radiology Reports that Require Follow-up Imaging Using Natural Language Processing Feature Engineering and Machine Learning Classification
    Lou, Robert
    Lalevic, Darco
    Chambers, Charles
    Zafar, Hanna M.
    Cook, Tessa S.
    JOURNAL OF DIGITAL IMAGING, 2020, 33 (01) : 131 - 136