A sentiment corpus for the cryptocurrency financial domain: the CryptoLin corpus

被引:0
|
作者
Gadi, Manoel Fernando Alonso [1 ]
Sicilia, Miguel Angel [1 ]
机构
[1] Univ Alcala De Henares, Plaza San Diego S-N, Alcala De Henares 28801, Madrid, Spain
关键词
News; NLP; Cryptocurrency; FinBERT; Events; Labeled dataset; Sentiment corpus;
D O I
10.1007/s10579-024-09743-x
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The objective of this paper is to describe Cryptocurrency Linguo (CryptoLin), a novel corpus containing 2683 cryptocurrency-related news articles covering more than a three-year period. CryptoLin was human-annotated with discrete values representing negative, neutral, and positive news respectively. Eighty-three people participated in the annotation process; each news title was randomly assigned and blindly annotated by three human annotators, one in each different cohort, followed by a consensus mechanism using simple voting. The selection of the annotators was intentionally made using three cohorts with students from a very diverse set of nationalities and educational backgrounds to minimize bias as much as possible. In case one of the annotators was in total disagreement with the other two (e.g., one negative vs two positive or one positive vs two negative), we considered this minority report and defaulted the labeling to neutral. Fleiss's Kappa, Krippendorff's Alpha, and Gwet's AC1 inter-rater reliability coefficients demonstrate CryptoLin's acceptable quality of inter-annotator agreement. The dataset also includes a text span with the three manual label annotations for further auditing of the annotation mechanism. To further assess the quality of the labeling and the usefulness of CryptoLin dataset, it incorporates four pretrained Sentiment Analysis models: Vader, Textblob, Flair, and FinBERT. Vader and FinBERT demonstrate reasonable performance in the CryptoLin dataset, indicating that the data was not annotated randomly and is therefore useful for further research1. FinBERT (negative) presents the best performance, indicating an advantage of being trained with financial news. Both the CryptoLin dataset and the Jupyter Notebook with the analysis, for reproducibility, are available at the project's Github. Overall, CryptoLin aims to complement the current knowledge by providing a novel and publicly available Gadi and & Aacute;ngel Sicilia (Cryptolin dataset and python jupyter notebooks reproducibility codes, 2022) cryptocurrency sentiment corpus and fostering research on the topic of cryptocurrency sentiment analysis and potential applications in behavioral science. This can be useful for businesses and policymakers who want to understand how cryptocurrencies are being used and how they might be regulated. Finally, the rules for selecting and assigning annotators make CryptoLin unique and interesting for new research in annotator selection, assignment, and biases.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Annotators' Selection Impact on the Creation of a Sentiment Corpus for the Cryptocurrency Financial Domain
    Gadi, Manoel Fernando Alonso
    Sicilia, Miguel Angel
    IEEE ACCESS, 2023, 11 : 131081 - 131088
  • [2] A multi-source entity-level sentiment corpus for the financial domain: the FinLin corpus
    Daudert, Tobias
    LANGUAGE RESOURCES AND EVALUATION, 2022, 56 (01) : 333 - 356
  • [3] A multi-source entity-level sentiment corpus for the financial domain: the FinLin corpus
    Tobias Daudert
    Language Resources and Evaluation, 2022, 56 : 333 - 356
  • [4] SENTIMENT AND CONFIDENCE IN FINANCIAL ENGLISH: A CORPUS STUDY
    Mackenzie, J. Lachlan
    VESTNIK ROSSIISKOGO UNIVERSITETA DRUZHBY NARODOV-SERIYA LINGVISTIKA-RUSSIAN JOURNAL OF LINGUISTICS, 2018, 22 (01): : 80 - 93
  • [5] BeSt: The Belief and Sentiment Corpus
    Tracey, Jennifer
    Rambow, Owen
    Arrigo, Michael
    Cardie, Claire
    Dalton, Adam
    Dang, Hoa
    Diab, Mona
    Dorr, Bonnie
    Guthrie, Louise
    Markowska, Magdalena
    Muresan, Smaranda
    Prabhakaran, Vinodkumar
    Shaikh, Samira
    Strzalkowski, Tomek
    Wiebe, Janyce
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 2460 - 2467
  • [6] Building Domain Specific Sentiment Lexicons Combining Information from Many Sentiment Lexicons and a Domain Specific Corpus
    Hammer, Hugo
    Yazidi, Anis
    Bai, Aleksander
    Engelstad, Paal
    COMPUTER SCIENCE AND ITS APPLICATIONS, CIIA 2015, 2015, 456 : 205 - 216
  • [7] Identifying polarity in financial texts for sentiment analysis: a corpus-based approach
    Moreno-Ortiz, Antonio
    Fernandez-Cruz, Javier
    CURRENT WORK IN CORPUS LINGUISTICS: WORKING WITH TRADITIONALLY- CONCEIVED CORPORA AND BEYOND (CILC2015), 2015, 198 : 330 - 338
  • [8] Sentiment Analysis on (Bengali Horoscope) Corpus
    Ghosal, Tirthankar
    Das, Sajal K.
    Bhattacharjee, Saprativa
    2015 ANNUAL IEEE INDIA CONFERENCE (INDICON), 2015,
  • [9] Research on construction of Tibetan sentiment corpus
    Huang, Tao
    Yan, Xiaodong
    2015 10TH INTERNATIONAL CONFERENCE ON BROADBAND AND WIRELESS COMPUTING, COMMUNICATION AND APPLICATIONS (BWCCA 2015), 2015, : 591 - 593
  • [10] Annotation of a Corpus of Tweets for Sentiment Analysis
    dos Santos, Allisfrank
    Barros Junior, Jorge Daniel
    Camargo, Heloisa de Arruda
    COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2018, 2018, 11122 : 294 - 302