A sentiment corpus for the cryptocurrency financial domain: the CryptoLin corpus

被引:0
|
作者
Gadi, Manoel Fernando Alonso [1 ]
Sicilia, Miguel Angel [1 ]
机构
[1] Univ Alcala De Henares, Plaza San Diego S-N, Alcala De Henares 28801, Madrid, Spain
关键词
News; NLP; Cryptocurrency; FinBERT; Events; Labeled dataset; Sentiment corpus;
D O I
10.1007/s10579-024-09743-x
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The objective of this paper is to describe Cryptocurrency Linguo (CryptoLin), a novel corpus containing 2683 cryptocurrency-related news articles covering more than a three-year period. CryptoLin was human-annotated with discrete values representing negative, neutral, and positive news respectively. Eighty-three people participated in the annotation process; each news title was randomly assigned and blindly annotated by three human annotators, one in each different cohort, followed by a consensus mechanism using simple voting. The selection of the annotators was intentionally made using three cohorts with students from a very diverse set of nationalities and educational backgrounds to minimize bias as much as possible. In case one of the annotators was in total disagreement with the other two (e.g., one negative vs two positive or one positive vs two negative), we considered this minority report and defaulted the labeling to neutral. Fleiss's Kappa, Krippendorff's Alpha, and Gwet's AC1 inter-rater reliability coefficients demonstrate CryptoLin's acceptable quality of inter-annotator agreement. The dataset also includes a text span with the three manual label annotations for further auditing of the annotation mechanism. To further assess the quality of the labeling and the usefulness of CryptoLin dataset, it incorporates four pretrained Sentiment Analysis models: Vader, Textblob, Flair, and FinBERT. Vader and FinBERT demonstrate reasonable performance in the CryptoLin dataset, indicating that the data was not annotated randomly and is therefore useful for further research1. FinBERT (negative) presents the best performance, indicating an advantage of being trained with financial news. Both the CryptoLin dataset and the Jupyter Notebook with the analysis, for reproducibility, are available at the project's Github. Overall, CryptoLin aims to complement the current knowledge by providing a novel and publicly available Gadi and & Aacute;ngel Sicilia (Cryptolin dataset and python jupyter notebooks reproducibility codes, 2022) cryptocurrency sentiment corpus and fostering research on the topic of cryptocurrency sentiment analysis and potential applications in behavioral science. This can be useful for businesses and policymakers who want to understand how cryptocurrencies are being used and how they might be regulated. Finally, the rules for selecting and assigning annotators make CryptoLin unique and interesting for new research in annotator selection, assignment, and biases.
引用
收藏
页数:19
相关论文
共 50 条
  • [31] Constructing a Chinese Conversation Corpus for Sentiment Analysis
    Zhou, Yujun
    Li, Changliang
    Xu, Bo
    Xu, Jiaming
    Yang, Lei
    Xu, Bo
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2017, 2018, 10619 : 578 - 589
  • [32] Sentiment Analysis and Opinion Mining: The EmotiBlog Corpus
    Fernandez, Javi
    Boldrini, Ester
    Manuel Gomez, Jose
    Martinez-Barco, Patricio
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2011, (47): : 179 - 187
  • [33] Building a Sentiment Corpus using a Gamified Framework
    Tiam-Lee, Thomas James
    See, Solomon
    2014 INTERNATIONAL CONFERENCE ON HUMANOID, NANOTECHNOLOGY, INFORMATION TECHNOLOGY, COMMUNICATION AND CONTROL, ENVIRONMENT AND MANAGEMENT (HNICEM), 2014,
  • [34] A Review on Corpus Annotation for Arabic Sentiment Analysis
    Almuqren, Latifah
    Alzammam, Arwa
    Alotaibi, Shahad
    Cristea, Alexandra
    Alhumoud, Sarah
    SOCIAL COMPUTING AND SOCIAL MEDIA: APPLICATIONS AND ANALYTICS, SCSM 2017, PT II, 2017, 10283 : 215 - 225
  • [35] SEDAR: a Large Scale French-English Financial Domain Parallel Corpus
    Ghaddar, Abbas
    Langlais, Philippe
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3595 - 3602
  • [36] AspectEmo: Multi-Domain Corpus of Consumer Reviews for Aspect-Based Sentiment Analysis
    Kocon, Jan
    Radom, Jarema
    Kaczmarz-Wawryk, Ewa
    Wabnic, Kamil
    Zajaczkowska, Ada
    Zasko-Zielinska, Monika
    21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS ICDMW 2021, 2021, : 166 - 173
  • [37] Impact of corpus domain for sentiment classification: An evaluation study using supervised machine learning techniques
    Karsi, Redouane
    Zaim, Mounia
    El Alami, Jamila
    2ND INTERNATIONAL CONFERENCE ON MEASUREMENT INSTRUMENTATION AND ELECTRONICS, 2017, 870
  • [38] AFPCorp: a corpus of advertisements for financial products
    Adams, Heather
    LFE-REVISTA DE LENGUAS PARA FINES ESPECIFICOS, 2011, 17 : 377 - 412
  • [39] Sentiment and emotion in financial journalism: a corpus-based, cross-linguistic analysis of the effects of COVID
    Chelo Vargas-Sierra
    M. Ángeles Orts
    Humanities and Social Sciences Communications, 10
  • [40] Sentiment and emotion in financial journalism: a corpus-based, cross-linguistic analysis of the effects of COVID
    Vargas-Sierra, Chelo
    Orts, M. Angeles
    HUMANITIES & SOCIAL SCIENCES COMMUNICATIONS, 2023, 10 (01):