Comparative Analysis of Machine Learning Algorithms for Email Phishing Detection Using TF-IDF, Word2Vec, and BERT

被引：0

作者：

Al Tawil, Arar ^{[1
]}

Almazaydeh, Laiali ^{[2
]}

Qawasmeh, Doaa ^{[3
]}

Qawasmeh, Baraah ^{[4
]}

Alshinwan, Mohammad ^{[1
,5
]}

Elleithy, Khaled ^{[6
]}

机构：

[1] Appl Sci Private Univ, Fac Informat Technol, Amman 11931, Jordan

[2] Abu Dhabi Univ, Coll Engn, POB 1790, Abu Dhabi, U Arab Emirates

[3] Al Balqa Appl Univ, Fac Artificial Intelligence, Salt 19117, Jordan

[4] Western Michigan Univ, Dept Civil & Construct Engn, Kalamazoo, MI 49008 USA

[5] Middle East Univ, MEU Res Unit, Amman 11831, Jordan

[6] Univ Bridgeport, Dept Comp Sci & Engn, Bridgeport, CT 06604 USA

来源：

CMC-COMPUTERS MATERIALS & CONTINUA | 2024年 / 81卷 / 02期

关键词：

Attacks; email phishing; machine learning; security; representations from transformers (BERT); text classifeir; natural language processing (NLP);

D O I：

10.32604/cmc.2024.057279

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Cybercriminals often use fraudulent emails and fictitious email accounts to deceive individuals into disclosing confidential information, a practice known as phishing. This study utilizes three distinct methodologies, Term Frequency-Inverse Document Frequency, Word2Vec, and Bidirectional Encoder Representations from Transformers, to evaluate the effectiveness of various machine learning algorithms in detecting phishing attacks. The study uses feature extraction methods to assess the performance of Logistic Regression, Decision Tree, Random Forest, and Multilayer Perceptron algorithms. The best results for each classifier using Term Frequency-Inverse Document Frequency were Multilayer Perceptron (Precision: 0.98, Recall: 0.98, F1-score: 0.98, Accuracy: 0.98). Word2Vec's best results were Multilayer Perceptron (Precision: 0.98, Recall: 0.98, F1-score: 0.98, Accuracy: 0.98). The highest performance was achieved using the Bidirectional Encoder Representations from the Transformers model, with Precision, Recall, F1-score, and Accuracy all reaching 0.99. This study highlights how advanced pre-trained models, such as Bidirectional Encoder Representations from Transformers, can significantly enhance the accuracy and reliability of fraud detection systems.

引用

页码：3395 / 3412

页数：18

共 50 条

[1] A study of damp-heat syndrome classification Using Word2vec and TF-IDF
Zhu, Wei
Zhang, Wei
Li, Guo-Zheng
He, Chong
Zhang, Lei
2016 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2016, : 1415 - 1420
[2] 基于TF-IDF与Word2vec的新闻热点分析
王婧
中国有线电视, 2023, (02) : 59 - 63
[3] Text classification algorithm of tourist attractions subcategories with modified TF-IDF and Word2Vec
Xiao, Lu
Li, Qiaoxing
Ma, Qian
Shen, Jiasheng
Yang, Yong
Li, Danyang
PLOS ONE, 2024, 19 (10):
[4] 基于TF-IDF与Word2vec的用户评论分析研究
刘宇韬
施莉
刘诗含
成都航空职业技术学院学报, 2022, 38 (04) : 89 - 92
[5] 基于TF-IDF与word2vec的台词文本分类研究
但宇豪
黄继风
杨琳
高海
上海师范大学学报(自然科学版), 2020, 49 (自然科学版) : 89 - 95
[6] 基于TF-IDF与word2vec的台词文本分类研究
但宇豪
黄继风
杨琳
高海
上海师范大学学报(自然科学版), 2020, 49 (01) : 89 - 95
[7] Question classification based on Bloom's taxonomy cognitive domain using modified TF-IDF and word2vec
Mohammed, Manal
Omar, Nazlia
PLOS ONE, 2020, 15 (03):
[8] 基于Word2vec和改进TF-IDF算法的深度学习模型研究
石琳
徐瑞龙
计算机与数字工程, 2021, 49 (05) : 966 - 970
[9] 基于TF-IDF和word2Vec的中文文本自动摘要模型
龚永罡
郭远南
中国新通信, 2023, 25 (02) : 65 - 67
[10] TF-IDF和Word2vec在新闻文本分类中的比较研究
王丽
肖小玲
张乐乐
电脑知识与技术, 2020, 16 (29) : 220 - 222

← 1 2 3 4 5 →