The Document Vectors Using Cosine Similarity Revisited

被引:0
|
作者
Zhang Bingyu [1 ]
Arefyev, Nikolay [1 ,2 ,3 ]
机构
[1] Natl Res Univ Higher Sch Econ, Moscow, Russia
[2] Samsung Res Ctr Russia, Moscow, Russia
[3] Lomonosov Moscow State Univ, Moscow, Russia
关键词
D O I
暂无
中图分类号
F [经济];
学科分类号
02 ;
摘要
The current state-of-the-art test accuracy (97.42%) on the IMDB movie reviews dataset was reported by Thongtan and Phienthrakul (2019) and achieved by the logistic regression classifier trained on the Document Vectors using Cosine Similarity (DV-ngrams-cosine) proposed in their paper and the Bag-of-N-grams (BON) vectors scaled by Naive Bayesian weights. While large pre-trained Transformer-based models have shown SOTA results across many datasets and tasks, the aforementioned model has not been surpassed by them, despite being much simpler and pre-trained on the IMDB dataset only. In this paper, we describe an error in the evaluation procedure of this model, which was found when we were trying to analyze its excellent performance on the IMDB dataset. We further show that the previously reported test accuracy of 97.42% is invalid and should be corrected to 93.68%. We also analyze the model performance with different amounts of training data (subsets of the IMDB dataset) and compare it to the Transformer-based RoBERTa model. The results show that while RoBERTa has a clear advantage for larger training sets, the DV-ngramscosine performs better than RoBERTa when the labelled training set is very small (10 or 20 documents). Finally, we introduce a sub-sampling scheme based on Naive Bayesian weights for the training process of the DV-ngrams-cosine, which leads to faster training and better quality.
引用
收藏
页码:129 / 133
页数:5
相关论文
共 50 条
  • [21] Enhancement of Performance of Document Clustering in the Authorship Identification Problem with a Weighted Cosine Similarity
    Martin-del-Campo-Rodriguez, Carolina
    Sidorov, Grigori
    Batyrshin, Ildar
    ADVANCES IN COMPUTATIONAL INTELLIGENCE, MICAI 2018, PT II, 2018, 11289 : 49 - 56
  • [22] A Metaphor Detection Approach Using Cosine Similarity
    Pramanick, Malay
    Mitra, Pabitra
    PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PREMI 2017, 2017, 10597 : 358 - 364
  • [23] Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks
    Luo, Chunjie
    Zhan, Jianfeng
    Xue, Xiaohe
    Wang, Lei
    Ren, Rui
    Yang, Qiang
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2018, PT I, 2018, 11139 : 382 - 391
  • [24] A Comparative Study on Cosine Similarity Algorithm and Vector Space Model Algorithm on Document Searching
    Nengsih, Warnia
    ADVANCED SCIENCE LETTERS, 2015, 21 (10) : 3321 - 3323
  • [25] Learning similarity with cosine similarity ensemble
    Xia, Peipei
    Zhang, Li
    Li, Fanzhang
    INFORMATION SCIENCES, 2015, 307 : 39 - 52
  • [26] Analysis of Stability in Static Signatures using Cosine Similarity
    Impedovo, D.
    Pirlo, G.
    Sarcinella, L.
    Stasolla, E.
    Trullo, C. A.
    13TH INTERNATIONAL CONFERENCE ON FRONTIERS IN HANDWRITING RECOGNITION (ICFHR 2012), 2012, : 231 - 235
  • [27] Automatic Thai Subjective Examination using Cosine Similarity
    Saipech, Pongsakorn
    Seresangtakul, Pusadee
    2018 5TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATICS: CONCEPTS, THEORY AND APPLICATIONS (ICAICTA 2018), 2018, : 214 - 218
  • [28] Hindi Word Sense Disambiguation Using Cosine Similarity
    Sarika, D. K.
    Sharma, Dilip Kumar
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON ICT FOR SUSTAINABLE DEVELOPMENT ICT4SD 2015, VOL 2, 2016, 409 : 801 - 808
  • [29] Matching Scientific Article Titles using Cosine Similarity and Jaccard Similarity Algorithm
    Rinjeni, Tri Puspa
    Indriawan, Ade
    Rakhmawati, Nur Aini
    Procedia Computer Science, 2024, 234 : 553 - 560
  • [30] Analysis of Dental Material Components Using Cosine Similarity
    Uematsu, Yasuaki
    Hori, Miki
    Kato, Akiko
    Hayashi, Tatsuhide
    Kawai, Tatsushi
    JOURNAL OF HARD TISSUE BIOLOGY, 2024, 33 (01) : 19 - 22