The Document Vectors Using Cosine Similarity Revisited

被引:0
|
作者
Zhang Bingyu [1 ]
Arefyev, Nikolay [1 ,2 ,3 ]
机构
[1] Natl Res Univ Higher Sch Econ, Moscow, Russia
[2] Samsung Res Ctr Russia, Moscow, Russia
[3] Lomonosov Moscow State Univ, Moscow, Russia
关键词
D O I
暂无
中图分类号
F [经济];
学科分类号
02 ;
摘要
The current state-of-the-art test accuracy (97.42%) on the IMDB movie reviews dataset was reported by Thongtan and Phienthrakul (2019) and achieved by the logistic regression classifier trained on the Document Vectors using Cosine Similarity (DV-ngrams-cosine) proposed in their paper and the Bag-of-N-grams (BON) vectors scaled by Naive Bayesian weights. While large pre-trained Transformer-based models have shown SOTA results across many datasets and tasks, the aforementioned model has not been surpassed by them, despite being much simpler and pre-trained on the IMDB dataset only. In this paper, we describe an error in the evaluation procedure of this model, which was found when we were trying to analyze its excellent performance on the IMDB dataset. We further show that the previously reported test accuracy of 97.42% is invalid and should be corrected to 93.68%. We also analyze the model performance with different amounts of training data (subsets of the IMDB dataset) and compare it to the Transformer-based RoBERTa model. The results show that while RoBERTa has a clear advantage for larger training sets, the DV-ngramscosine performs better than RoBERTa when the labelled training set is very small (10 or 20 documents). Finally, we introduce a sub-sampling scheme based on Naive Bayesian weights for the training process of the DV-ngrams-cosine, which leads to faster training and better quality.
引用
收藏
页码:129 / 133
页数:5
相关论文
共 50 条
  • [41] Dimension Independent Cosine Similarity for Collaborative Filtering using MapReduce
    Shen, Fei
    Jiamthapthaksin, Rachsuda
    2016 8TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SMART TECHNOLOGY (KST), 2016, : 72 - 76
  • [42] Cosine similarity and the Borda rule
    Kawada, Yoko
    SOCIAL CHOICE AND WELFARE, 2018, 51 (01) : 1 - 11
  • [43] Cosine Similarity Drift Detector
    Gonzalez Hidalgo, Juan Isidro
    Palomino Marino, Laura Maria
    Maior de Barros, Roberto Souto
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2019: TEXT AND TIME SERIES, PT IV, 2019, 11730 : 669 - 685
  • [44] Cosine similarity and the Borda rule
    Yoko Kawada
    Social Choice and Welfare, 2018, 51 : 1 - 11
  • [45] A Triangle Inequality for Cosine Similarity
    Schubert, Erich
    SIMILARITY SEARCH AND APPLICATIONS, SISAP 2021, 2021, 13058 : 32 - 44
  • [46] Semantic Document Clustering Using a Similarity Graph
    Stanchev, Lubomir
    2016 IEEE TENTH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2016, : 1 - 8
  • [47] Fast adaptive filter using discrete cosine transform basis vectors
    Ochi, Hiroshi
    Kinjo, Shigenori
    Electronics and Communications in Japan, Part III: Fundamental Electronic Science (English translation of Denshi Tsushin Gakkai Ronbunshi), 1993, 76 (11): : 58 - 68
  • [48] An Efficient Similarity Join Algorithm with Cosine Similarity Predicate
    Lee, Dongjoo
    Park, Jaehui
    Shim, Junho
    Lee, Sang-goo
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT 2, 2010, 6262 : 422 - +
  • [49] Document Similarity
    Duzi, Marie
    Mensik, Marek
    Perdek, Michal
    INFORMATION MODELLING AND KNOWLEDGE BASES XXIV, 2013, 251 : 241 - 254
  • [50] STIELTJES VECTORS AND COSINE FUNCTIONS GENERATORS
    CIORANESCU, I
    NEUMANN, U
    STUDIA MATHEMATICA, 1987, 87 (01) : 1 - 7