The Document Vectors Using Cosine Similarity Revisited

被引:0
|
作者
Zhang Bingyu [1 ]
Arefyev, Nikolay [1 ,2 ,3 ]
机构
[1] Natl Res Univ Higher Sch Econ, Moscow, Russia
[2] Samsung Res Ctr Russia, Moscow, Russia
[3] Lomonosov Moscow State Univ, Moscow, Russia
关键词
D O I
暂无
中图分类号
F [经济];
学科分类号
02 ;
摘要
The current state-of-the-art test accuracy (97.42%) on the IMDB movie reviews dataset was reported by Thongtan and Phienthrakul (2019) and achieved by the logistic regression classifier trained on the Document Vectors using Cosine Similarity (DV-ngrams-cosine) proposed in their paper and the Bag-of-N-grams (BON) vectors scaled by Naive Bayesian weights. While large pre-trained Transformer-based models have shown SOTA results across many datasets and tasks, the aforementioned model has not been surpassed by them, despite being much simpler and pre-trained on the IMDB dataset only. In this paper, we describe an error in the evaluation procedure of this model, which was found when we were trying to analyze its excellent performance on the IMDB dataset. We further show that the previously reported test accuracy of 97.42% is invalid and should be corrected to 93.68%. We also analyze the model performance with different amounts of training data (subsets of the IMDB dataset) and compare it to the Transformer-based RoBERTa model. The results show that while RoBERTa has a clear advantage for larger training sets, the DV-ngramscosine performs better than RoBERTa when the labelled training set is very small (10 or 20 documents). Finally, we introduce a sub-sampling scheme based on Naive Bayesian weights for the training process of the DV-ngrams-cosine, which leads to faster training and better quality.
引用
收藏
页码:129 / 133
页数:5
相关论文
共 50 条
  • [31] Detecting obfuscated viruses using cosine similarity analysis
    Karnik, Abhishek
    Goswami, Suchandra
    Guha, Ratan
    AMS 2007: FIRST ASIA INTERNATIONAL CONFERENCE ON MODELLING & SIMULATION ASIA MODELLING SYMPOSIUM, PROCEEDINGS, 2007, : 165 - +
  • [32] COSINE FUNCTIONS REVISITED
    Yang, Dilian
    BANACH JOURNAL OF MATHEMATICAL ANALYSIS, 2011, 5 (02) : 126 - 130
  • [33] A Document Recommendation System Using a Document-Similarity Ontology
    Vences, R.
    Gomez, J.
    Menendez, V.
    IEEE LATIN AMERICA TRANSACTIONS, 2016, 14 (07) : 3329 - 3334
  • [34] Similarity in Patient Support Forums Using TF-IDF and Cosine Similarity Metrics
    Alodadi, Mohammad
    Janeja, Vandana P.
    2015 IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI 2015), 2015, : 521 - 522
  • [35] Construction of document feature vectors using BERT
    Tanaka, Hirotaka
    Cao, Rui
    Bai, Jing
    Ma, Wen
    Shinnou, Hiroyuki
    2020 25TH INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE (TAAI 2020), 2020, : 232 - 236
  • [36] A Document Descriptor using Covariance of Word Vectors
    Torki, Marwan
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, 2018, : 527 - 532
  • [37] A Taxonomy based Semantic Similarity of Documents using the Cosine Measure
    Madylova, Ainura
    Oguducu, Sule Guenduez
    2009 24TH INTERNATIONAL SYMPOSIUM ON COMPUTER AND INFORMATION SCIENCES, 2009, : 129 - 134
  • [38] Fuzzy lattice neurocomputing using weighted cosine similarity measure
    Cripps, Al
    Nguyen, Nghiep
    2007 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-6, 2007, : 236 - +
  • [39] A novel document ranking method using the discrete cosine transform
    Park, LAF
    Palaniswami, M
    Ramamohanarao, K
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2005, 27 (01) : 130 - 135
  • [40] Content based Document Classification using Soft Cosine Measure
    Hasan, Md Zahid
    Hossain, Shakhawat
    Rizvee, Md Arif
    Rana, Md Shohel
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2019, 10 (04) : 522 - 528