A sentiment corpus for the cryptocurrency financial domain: the CryptoLin corpus

被引：0

作者：

Gadi, Manoel Fernando Alonso ^{[1
]}

Sicilia, Miguel Angel ^{[1
]}

机构：

[1] Univ Alcala De Henares, Plaza San Diego S-N, Alcala De Henares 28801, Madrid, Spain

来源：

LANGUAGE RESOURCES AND EVALUATION | 2024年

关键词：

News; NLP; Cryptocurrency; FinBERT; Events; Labeled dataset; Sentiment corpus;

D O I：

10.1007/s10579-024-09743-x

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

The objective of this paper is to describe Cryptocurrency Linguo (CryptoLin), a novel corpus containing 2683 cryptocurrency-related news articles covering more than a three-year period. CryptoLin was human-annotated with discrete values representing negative, neutral, and positive news respectively. Eighty-three people participated in the annotation process; each news title was randomly assigned and blindly annotated by three human annotators, one in each different cohort, followed by a consensus mechanism using simple voting. The selection of the annotators was intentionally made using three cohorts with students from a very diverse set of nationalities and educational backgrounds to minimize bias as much as possible. In case one of the annotators was in total disagreement with the other two (e.g., one negative vs two positive or one positive vs two negative), we considered this minority report and defaulted the labeling to neutral. Fleiss's Kappa, Krippendorff's Alpha, and Gwet's AC1 inter-rater reliability coefficients demonstrate CryptoLin's acceptable quality of inter-annotator agreement. The dataset also includes a text span with the three manual label annotations for further auditing of the annotation mechanism. To further assess the quality of the labeling and the usefulness of CryptoLin dataset, it incorporates four pretrained Sentiment Analysis models: Vader, Textblob, Flair, and FinBERT. Vader and FinBERT demonstrate reasonable performance in the CryptoLin dataset, indicating that the data was not annotated randomly and is therefore useful for further research1. FinBERT (negative) presents the best performance, indicating an advantage of being trained with financial news. Both the CryptoLin dataset and the Jupyter Notebook with the analysis, for reproducibility, are available at the project's Github. Overall, CryptoLin aims to complement the current knowledge by providing a novel and publicly available Gadi and & Aacute;ngel Sicilia (Cryptolin dataset and python jupyter notebooks reproducibility codes, 2022) cryptocurrency sentiment corpus and fostering research on the topic of cryptocurrency sentiment analysis and potential applications in behavioral science. This can be useful for businesses and policymakers who want to understand how cryptocurrencies are being used and how they might be regulated. Finally, the rules for selecting and assigning annotators make CryptoLin unique and interesting for new research in annotator selection, assignment, and biases.

引用

页数：19

共 50 条

[21] Extracting Multiword Sentiment Expressions by Using a Domain-Specific Corpus and a Seed Lexicon
Lee, Kong-Joo
Kim, Jee-Eun
Yun, Bo-Hyun
ETRI JOURNAL, 2013, 35 (05) : 838 - 848
[22] Adaptation of Multi-domain Corpus Learned Seeds and Polarity Lexicon for Sentiment Analysis
Sanagar, Swati
Gupta, Deepa
2015 INTERNATIONAL CONFERENCE ON COMPUTING AND NETWORK COMMUNICATIONS (COCONET), 2015, : 50 - 58
[23] Annotated Corpus for Sentiment Analysis in Odia Language
Mohanty, Gaurav
Mishra, Pruthwik
Mamidi, Radhika
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2788 - 2795
[24] Sentiment and Behaviour Annotation in a Corpus of Dialogue Summaries
Roman, Norton Trevisan
Piwek, Paul
Brito Rizzoni Carvalho, Ariadne Maria
Alvares, Alexandre Rossi
JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2015, 21 (04) : 561 - 586
[25] ParlVote: A Corpus for Sentiment Analysis of Political Debates
Abercrombie, Gavin
Batista-Navarro, Riza
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 5073 - 5078
[26] Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Pak, Alexander
Paroubek, Patrick
LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010,
[27] Building a Sentiment Corpus of Tweets in Brazilian Portuguese
Brum, Henrico Bertini
Volpe Nunes, Maria das Gracas
PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 4167 - 4172
[28] A Prototype for a Standard Arabic Sentiment Analysis Corpus
Al-Kabi, Mohammed
Al-Ayyoub, Mahmoud
Alsmadi, Izzat
Wahsheh, Heider
INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2016, 13 (1A) : 163 - 170
[29] Spanish Corpus for Sentiment Analysis Towards Brands
Navas-Loro, Maria
Rodriguez-Doncel, Victor
Santana-Perez, Idafen
Sanchez, Alberto
SPEECH AND COMPUTER, SPECOM 2017, 2017, 10458 : 680 - 689
[30] An Arabic Twitter Corpus for Subjectivity and Sentiment Analysis
Refaee, Eshrag
Rieser, Verena
LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2268 - 2273

← 1 2 3 4 5 →