Automated Detection of Malevolent Domains in Cyberspace Using Natural Language Processing and Machine Learning

被引：0

作者：

Samad, Saleem Raja Abdul ^{[1
]}

Ganesan, Pradeepa ^{[1
]}

Al-Kaabi, Amna Salim ^{[1
]}

Rajasekaran, Justin

Singaravelan, M. ^{[2
]}

Basha, Peerbasha Shebbeer ^{[3
]}

机构：

[1] Univ Technol & Appl Sci Ibri, Coll Comp & Informat Sci, IT Dept, Shinas, Oman

[2] Vel Tech Rangarajan Dr Sagunthala R&D Inst Sci & T, Dept Comp Sci & Engn, Chennai, Tamil Nadu, India

[3] Jamal Mohamed Coll, Dept Comp Sci, Tiruchirappalli, Tamil Nadu, India

来源：

INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS | 2024年 / 15卷 / 10期

关键词：

Machine learning; N-gram; linguistic features; natural language processing (NLP); malicious webpage;

D O I：

10.14569/IJACSA.2024.0151036

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Cyberattacks are intentional attacks on computer systems, networks, and devices. Malware, phishing, drive-by downloads, and injection are popular cyberattacks that can harm individuals, businesses, and organizations. Most of these attacks trick internet users by using malicious links or webpages. Malicious webpages can be used to distribute malware, steal personal information, conduct phishing attacks, or perform other malicious activities. Detecting such malicious websites is a tedious task for internet users. Therefore, locating such a website in cyberspace requires an automated detection tool. Currently, machine learning techniques are being used to detect such malicious websites. The majority of recent studies derive limited number of features from webpages (both benign and malicious) and use machine learning (ML) algorithms to detect fraudulent webpages. However, these constrained capabilities might not use the full potential of the dataset. This study addresses this issue by identifying malicious websites using both the URL and webpage content features. To maximize detection accuracy, both ngrams and vectorization methods in natural language processing are adopted with minimum feature-set. To exploit the full potential of the dataset, the proposed approach derives the 22 common linguistic features of the URL and generates ngrams from the domain name of the URL. The textual content of the webpages was also used. The research employs seven machine learning algorithms with three vectorization methods. The outcome reveals that the proposed method outperformed the results of previous studies.

引用

页码：328 / 341

页数：14

共 50 条

[21] Automated prioritization of sick newborns for whole genome sequencing using clinical natural language processing and machine learning
Bennet Peterson
Edgar Javier Hernandez
Charlotte Hobbs
Sabrina Malone Jenkins
Barry Moore
Edwin Rosales
Samuel Zoucha
Erica Sanford
Matthew N. Bainbridge
Erwin Frise
Albert Oriol
Luca Brunelli
Stephen F. Kingsmore
Mark Yandell
Genome Medicine, 15
[22] Detecting Phishing Attacks Using Natural Language Processing and Machine Learning
Peng, Tianrui
Harris, Ian G.
Sawa, Yuki
2018 IEEE 12TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2018, : 300 - 301
[23] Analysis of Breakdown Reports Using Natural Language Processing and Machine Learning
Ahmed, Mobyen Uddin
Bengtsson, Marcus
Salonen, Antti
Funk, Peter
INTERNATIONAL CONGRESS AND WORKSHOP ON INDUSTRIAL AI 2021, 2022, : 40 - 52
[24] CATEGORIZING TELEMEDICINE VISITS USING NATURAL LANGUAGE PROCESSING AND MACHINE LEARNING
Sudaria, T.
Overcash, J.
Nguyen, N.
Oguntuga, A.
VALUE IN HEALTH, 2022, 25 (07) : S597 - S597
[25] Detecting Phishing Attacks Using Natural Language Processing And Machine Learning
Banu, Reshma
Anand, M.
Kamath, Akshatha C.
Ashika, S.
Ujwala, H. S.
Harshitha, S. N.
PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICCS), 2019, : 1210 - 1214
[26] Subjective Answers Evaluation Using Machine Learning and Natural Language Processing
Bashir, Muhammad Farrukh
Arshad, Hamza
Javed, Abdul Rehman
Kryvinska, Natalia
Band, Shahab S.
IEEE ACCESS, 2021, 9 : 158972 - 158983
[27] Leveraging Natural Language Processing and Machine Learning for Efficient Fake News Detection
Kumar, Naresh
Malhotra, Meetu
Aggarwal, Bharti
Rai, Dinesh
Aggarwal, Gaurav
Proceedings - International Conference on Technological Advancements in Computational Sciences, ICTACS 2023, 2023, : 535 - 541
[28] A Comparison of Natural Language Processing and Machine Learning Methods for Phishing Email Detection
Bountakas, Panagiotis
Koutroumpouchos, Konstantinos
Xenakis, Christos
ARES 2021: 16TH INTERNATIONAL CONFERENCE ON AVAILABILITY, RELIABILITY AND SECURITY, 2021,
[29] Automated Detection of Radiology Reports that Require Follow-up Imaging Using Natural Language Processing Feature Engineering and Machine Learning Classification
Robert Lou
Darco Lalevic
Charles Chambers
Hanna M. Zafar
Tessa S. Cook
Journal of Digital Imaging, 2020, 33 : 131 - 136
[30] Automated Detection of Radiology Reports that Require Follow-up Imaging Using Natural Language Processing Feature Engineering and Machine Learning Classification
Lou, Robert
Lalevic, Darco
Chambers, Charles
Zafar, Hanna M.
Cook, Tessa S.
JOURNAL OF DIGITAL IMAGING, 2020, 33 (01) : 131 - 136

← 1 2 3 4 5 →