Automated Detection of Malevolent Domains in Cyberspace Using Natural Language Processing and Machine Learning

被引:0
|
作者
Samad, Saleem Raja Abdul [1 ]
Ganesan, Pradeepa [1 ]
Al-Kaabi, Amna Salim [1 ]
Rajasekaran, Justin
Singaravelan, M. [2 ]
Basha, Peerbasha Shebbeer [3 ]
机构
[1] Univ Technol & Appl Sci Ibri, Coll Comp & Informat Sci, IT Dept, Shinas, Oman
[2] Vel Tech Rangarajan Dr Sagunthala R&D Inst Sci & T, Dept Comp Sci & Engn, Chennai, Tamil Nadu, India
[3] Jamal Mohamed Coll, Dept Comp Sci, Tiruchirappalli, Tamil Nadu, India
关键词
Machine learning; N-gram; linguistic features; natural language processing (NLP); malicious webpage;
D O I
10.14569/IJACSA.2024.0151036
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Cyberattacks are intentional attacks on computer systems, networks, and devices. Malware, phishing, drive-by downloads, and injection are popular cyberattacks that can harm individuals, businesses, and organizations. Most of these attacks trick internet users by using malicious links or webpages. Malicious webpages can be used to distribute malware, steal personal information, conduct phishing attacks, or perform other malicious activities. Detecting such malicious websites is a tedious task for internet users. Therefore, locating such a website in cyberspace requires an automated detection tool. Currently, machine learning techniques are being used to detect such malicious websites. The majority of recent studies derive limited number of features from webpages (both benign and malicious) and use machine learning (ML) algorithms to detect fraudulent webpages. However, these constrained capabilities might not use the full potential of the dataset. This study addresses this issue by identifying malicious websites using both the URL and webpage content features. To maximize detection accuracy, both ngrams and vectorization methods in natural language processing are adopted with minimum feature-set. To exploit the full potential of the dataset, the proposed approach derives the 22 common linguistic features of the URL and generates ngrams from the domain name of the URL. The textual content of the webpages was also used. The research employs seven machine learning algorithms with three vectorization methods. The outcome reveals that the proposed method outperformed the results of previous studies.
引用
收藏
页码:328 / 341
页数:14
相关论文
共 50 条
  • [31] An Empirical Study on Patent Novelty Detection: A Novel Approach Using Machine Learning and Natural Language Processing
    Chikkamath, Renukswamy
    Endres, Markus
    Bayyapu, Lavanya
    Hewel, Christoph
    2020 SEVENTH INTERNATIONAL CONFERENCE ON SOCIAL NETWORK ANALYSIS, MANAGEMENT AND SECURITY (SNAMS), 2020, : 135 - 141
  • [32] Presumptive Detection of Cyberbullying on Twitter through Natural Language Processing and Machine Learning in the Spanish Language
    Leon-Paredes, Gabriel A.
    Palomeque-Leon, Wilson F.
    Gallegos-Segovia, Pablo L.
    Vintimilla-Tapia, Paul E.
    Bravo-Torres, Jack F.
    Barbosa-Santillan, Liliana, I
    Paredes-Pinos, Maria M.
    2019 IEEE CHILEAN CONFERENCE ON ELECTRICAL, ELECTRONICS ENGINEERING, INFORMATION AND COMMUNICATION TECHNOLOGIES (CHILECON), 2019,
  • [33] Blockchain-Based Event Detection and Trust Verification Using Natural Language Processing and Machine Learning
    Shahbazi, Zeinab
    Byun, Yung-Cheol
    IEEE ACCESS, 2022, 10 : 5790 - 5800
  • [34] Natural language processing for automated detection of incidental durotomy
    Karhade, Aditya, V
    Bongers, Michiel E. R.
    Groot, Olivier Q.
    Kazarian, Erick R.
    Cha, Thomas D.
    Fogel, Harold A.
    Hershman, Stuart H.
    Tobert, Daniel G.
    Schoenfeld, Andrew J.
    Bono, Christopher M.
    Kang, James D.
    Harris, Mitchel B.
    Schwab, Joseph H.
    SPINE JOURNAL, 2020, 20 (05): : 695 - 700
  • [35] Automated detection of adverse events using natural language processing of discharge summaries
    Melton, CB
    Hripcsak, G
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2005, 12 (04) : 448 - 457
  • [36] Artificial learning companionusing machine learning and natural language processing
    R. Pugalenthi
    A Prabhu Chakkaravarthy
    J Ramya
    Samyuktha Babu
    R. Rasika Krishnan
    International Journal of Speech Technology, 2021, 24 : 553 - 560
  • [37] Artificial learning companionusing machine learning and natural language processing
    Pugalenthi, R.
    Prabhu Chakkaravarthy, A.
    Ramya, J.
    Babu, Samyuktha
    Rasika Krishnan, R.
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2021, 24 (03) : 553 - 560
  • [38] Automated vetting of radiology referrals: exploring natural language processing and traditional machine learning approaches
    Jaka Potočnik
    Edel Thomas
    Ronan Killeen
    Shane Foley
    Aonghus Lawlor
    John Stowe
    Insights into Imaging, 13
  • [39] Automated vetting of radiology referrals: exploring natural language processing and traditional machine learning approaches
    Potocnik, Jaka
    Thomas, Edel
    Killeen, Ronan
    Foley, Shane
    Lawlor, Aonghus
    Stowe, John
    INSIGHTS INTO IMAGING, 2022, 13 (01)
  • [40] Automated identification of patients with syncope in the textual health record - a feasibility study using machine learning and natural language processing
    Brekke, P.
    Pilan, I
    Husby, H.
    Gundersen, T.
    Dahl, F. A.
    Hurlen, P.
    Nytroe, O. E.
    Ovrelid, L.
    EUROPEAN HEART JOURNAL, 2020, 41 : 723 - 723