A Comparative Study on TF-IDF Feature Weighting Method and its Analysis using Unstructured Dataset

被引:0
|
作者
Das, Mamata [1 ]
Kamalanathan, Selvakumar [1 ]
Alphonse, P. J. A. [1 ]
机构
[1] NIT Trichy, Trichy 620015, Tamil Nadu, India
关键词
TF-IDF; N-Gram; Text classification; Feature weighting; Information retrieval; SENTIMENT; REVIEWS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text Classification is the process of categorizing text into the relevant categories and its algorithms are at the core of many Natural Language Processing (NLP). Term FrequencyInverse Document Frequency (TF-IDF) and NLP are the most highly used information retrieval methods in text classification. We have investigated and analyzed the feature weighting method for text classification on unstructured data. The proposed model considered two features NGrams and TF-IDF on the IMDB movie reviews and Amazon Alexa reviews dataset for sentiment analysis. Then we have used the state-of-the-art classifier to validate the method i.e., Support Vector Machine (SVM), Logistic Regression, Multinomial Naive Bayes (Multinomial NB), Random Forest, Decision Tree, and k-nearest neighbors (KNN). From those two feature extractions, a significant increase in feature extraction with TF-IDF features rather than based on N-Gram. TF-IDF got the maximum accuracy (93.81%), precision (94.20%), recall (93.81%), and F1-score (91.99%) value in Random Forest classifier.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports
    Jiang, Zhiying
    Gao, Bo
    He, Yanlin
    Han, Yongming
    Doyle, Paul
    Zhu, Qunxiong
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2021, 2021
  • [22] Comparative Analysis of Machine Learning Algorithms for Email Phishing Detection Using TF-IDF, Word2Vec, and BERT
    Al Tawil, Arar
    Almazaydeh, Laiali
    Qawasmeh, Doaa
    Qawasmeh, Baraah
    Alshinwan, Mohammad
    Elleithy, Khaled
    CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 81 (02): : 3395 - 3412
  • [23] Sentiment analysis of movie reviews based on NB approaches using TF-IDF and count vectorizer
    Danyal, Mian Muhammad
    Khan, Sarwar Shah
    Khan, Muzammil
    Ullah, Subhan
    Ghaffar, Muhammad Bilal
    Khan, Wahab
    SOCIAL NETWORK ANALYSIS AND MINING, 2024, 14 (01)
  • [24] Proposal of a method for understanding requirements specifications using visualization of description status by Page Trend and comparative evaluation of description status by tf-idf
    Nakamura, Yutaro
    Inoki, Mari
    Nagaoka, Takesi
    Kitagawa, Takayuki
    Honiden, Shinichi
    Computer Software, 41 (03): : 115 - 121
  • [25] A dimensionality reduction method for large-scale group decision-making using TF-IDF feature similarity and information loss entropy
    Wan, Qifeng
    Xu, Xuanhua
    Han, Jing
    APPLIED SOFT COMPUTING, 2024, 150
  • [26] Understanding people's attitudes in IoT systems using wellness probes and TF-IDF data analysis
    Sul, Sanghun
    Cho, Seung-Beom
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (35) : 82495 - 82514
  • [27] A study of damp-heat syndrome classification Using Word2vec and TF-IDF
    Zhu, Wei
    Zhang, Wei
    Li, Guo-Zheng
    He, Chong
    Zhang, Lei
    2016 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2016, : 1415 - 1420
  • [28] New book classification based on Dewey Decimal Classification (DDC) law using tf-idf and cosine similarity method
    Nurdiansyah, Y.
    Andrianto, A.
    Kamshal, L.
    2ND INTERNATIONAL CONFERENCE OF COMBINATORICS, GRAPH THEORY, AND NETWORK TOPOLOGY, 2019,
  • [29] Sentiment Analysis on COVID Tweets: An Experimental Analysis on the Impact of Count Vectorizer and TF-IDF on Sentiment Predictions using Deep Learning Models
    Raza, Ghulam Musa
    Butt, Zainab Saeed
    Latif, Seemab
    Wahid, Abdul
    2021 INTERNATIONAL CONFERENCE ON DIGITAL FUTURES AND TRANSFORMATIVE TECHNOLOGIES (ICODT2), 2021,
  • [30] A multilabel classification on topics of qur'anic verses in English translation using K-Nearest Neighbor method with Weighted TF-IDF
    Ulumudin, G., I
    Adiwijaya, A.
    Mubarok, M. S.
    2ND INTERNATIONAL CONFERENCE ON DATA AND INFORMATION SCIENCE, 2019, 1192