The Document Vectors Using Cosine Similarity Revisited

被引:0
|
作者
Zhang Bingyu [1 ]
Arefyev, Nikolay [1 ,2 ,3 ]
机构
[1] Natl Res Univ Higher Sch Econ, Moscow, Russia
[2] Samsung Res Ctr Russia, Moscow, Russia
[3] Lomonosov Moscow State Univ, Moscow, Russia
关键词
D O I
暂无
中图分类号
F [经济];
学科分类号
02 ;
摘要
The current state-of-the-art test accuracy (97.42%) on the IMDB movie reviews dataset was reported by Thongtan and Phienthrakul (2019) and achieved by the logistic regression classifier trained on the Document Vectors using Cosine Similarity (DV-ngrams-cosine) proposed in their paper and the Bag-of-N-grams (BON) vectors scaled by Naive Bayesian weights. While large pre-trained Transformer-based models have shown SOTA results across many datasets and tasks, the aforementioned model has not been surpassed by them, despite being much simpler and pre-trained on the IMDB dataset only. In this paper, we describe an error in the evaluation procedure of this model, which was found when we were trying to analyze its excellent performance on the IMDB dataset. We further show that the previously reported test accuracy of 97.42% is invalid and should be corrected to 93.68%. We also analyze the model performance with different amounts of training data (subsets of the IMDB dataset) and compare it to the Transformer-based RoBERTa model. The results show that while RoBERTa has a clear advantage for larger training sets, the DV-ngramscosine performs better than RoBERTa when the labelled training set is very small (10 or 20 documents). Finally, we introduce a sub-sampling scheme based on Naive Bayesian weights for the training process of the DV-ngrams-cosine, which leads to faster training and better quality.
引用
收藏
页码:129 / 133
页数:5
相关论文
共 50 条
  • [1] Sentiment Classification using Document Embeddings trained with Cosine Similarity
    Thongtan, Tan
    Phienthrakul, Tanasanee
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019:): STUDENT RESEARCH WORKSHOP, 2019, : 407 - 414
  • [2] Document Clustering using Concept Space and Cosine Similarity Measurement
    Muflikhah, Lailil
    Baharudin, Baharum
    PROCEEDINGS OF THE 2009 INTERNATIONAL CONFERENCE ON COMPUTER TECHNOLOGY AND DEVELOPMENT, VOL 1, 2009, : 58 - 62
  • [3] Document Understanding Using Improved Sqrt-Cosine Similarity
    Sohangir, Sahar
    Wang, Dingding
    2017 11TH IEEE INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2017, : 278 - 279
  • [4] An Improved Cosine Similarity Algorithm Based on Document Similarity
    Lee, Ming
    Zhao, Heji
    INTERNATIONAL SYMPOSIUM ON FUZZY SYSTEMS, KNOWLEDGE DISCOVERY AND NATURAL COMPUTATION (FSKDNC 2014), 2014, : 196 - 204
  • [5] Comparison of Semantic Vectors with Reduced Precision using the Cosine Similarity Measure
    Karwatowski, Michal
    Wielgosz, Maciej
    Pietron, Marcin
    Staruchowicz, Mateusz
    Wiatr, Kazimierz
    PROCEEDINGS OF THE 2017 INTELLIGENT SYSTEMS CONFERENCE (INTELLISYS), 2017, : 898 - 904
  • [6] Document Similarity Detection using K-Means and Cosine Distance
    Usino, Wendi
    Prabuwono, Anton Satria
    Allehaibi, Khalid Hamed S.
    Bramantoro, Arif
    Hasniaty, A.
    Amaldi, Wahyu
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2019, 10 (02) : 165 - 170
  • [7] Document similarity detection using K-Means and cosine distance
    Usino W.
    Prabuwono A.S.
    Allehaibi K.H.S.
    Bramantoro A.
    Hasniaty A.
    Amaldi W.
    Intl. J. Adv. Comput. Sci. Appl., 2 (165-170): : 165 - 170
  • [8] Comparing the Effectiveness of Query-Document Clusterings Using the QDSM and Cosine Similarity
    Gutierrez-Soto, Claudio
    Curiel Diaz, Arturo
    Hubert, Gilles
    2019 38TH INTERNATIONAL CONFERENCE OF THE CHILEAN COMPUTER SCIENCE SOCIETY (SCCC), 2019,
  • [9] Optimal Multi-document Integration Using Iterative Elimination and Cosine Similarity
    George, Fr Augustine
    Hanumanthappa, M.
    EMERGING TRENDS IN EXPERT APPLICATIONS AND SECURITY, 2019, 841 : 699 - 705
  • [10] Hierarchical Document Clustering based on Cosine Similarity measure
    Popat, Shraddha K.
    Deshmukh, Pramod B.
    Metre, Vishakha A.
    2017 1ST INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS AND INFORMATION MANAGEMENT (ICISIM), 2017, : 153 - 159