Mining Insights from Large-Scale Corpora Using Fine-Tuned Language Models

被引：11

作者：

Palakodety, Shriphani ^{[1
]}

KhudaBukhsh, Ashiqur R. ^{[2
]}

Carbonell, Jaime G. ^{[2
]}

机构：

[1] Onai, San Jose, CA 95129 USA

[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

来源：

ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE | 2020年 / 325卷

关键词：

ELECTION; TWITTER;

D O I：

10.3233/FAIA200306

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Mining insights from large volume of social media texts with minimal supervision is a highly challenging Natural Language Processing (NLP) task. While Language Models' (LMs) efficacy in several downstream tasks is well-studied, assessing their applicability in answering relational questions, tracking perception or mining deeper insights is under-explored. Few recent lines of work have scratched the surface by studying pre-trained LMs' (e.g., BERT) capability in answering relational questions through "fill-in-the-blank" cloze statements (e.g., [Dante was born in MASK]). BERT predicts the MASK-ed word with a list of words ranked by probability (in this case, BERT successfully predicts Florence with the highest probability). In this paper, we conduct a feasibility study of fine-tuned LMs with a different focus on tracking polls, tracking community perception and mining deeper insights typically obtained through costly surveys. Our main focus is on a substantial corpus of video comments extracted from YouTube videos (6,182,868 comments on 130,067 videos by 1,518,077 users) posted within 100 days prior to the 2019 Indian General Election. Using fill-in-the-blank cloze statements against a recent high-performance language modeling algorithm, BERT, we present a novel application of this family of tools that is able to (1) aggregate political sentiment (2) reveal community perception and (3) track evolving national priorities and issues of interest.

引用

页码：1890 / 1897

页数：8

共 50 条

[41] Website Category Classification Using Fine-tuned BERT Language Model
Demirkiran, Ferhat
Cayir, Aykut
Unal, Ugur
Dag, Hasan
2020 5TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2020, : 333 - 336
[42] Fine-Grained Spoiler Detection from Large-Scale Review Corpora
Wan, Mengting
Misra, Rishabh
Nakashole, Ndapa
McAuley, Julian
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 2605 - 2610
[43] Genealogical Relationship Extraction from Unstructured Text Using Fine-Tuned Transformer Models
Parrolivelli, Carloangello
Stanchev, Lubomir
2023 IEEE 17TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING, ICSC, 2023, : 167 - 174
[44] Assessing Programming Proficiency Through Eye Gaze Analysis Using Fine-Tuned Large Language Model
Li, Zheng
Holly, Dominic
PROCEEDINGS OF THE 2024 IEEE 10TH IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE AND SMART COMPUTING, HPSC 2024, 2024, : 7 - 12
[45] Online aggression detection using ensemble techniques on fine-tuned transformer-based language models
Chinivar S.
Roopa M.S.
Arunalatha J.S.
Venugopal K.R.
International Journal of Computers and Applications, 2024, 46 (08) : 567 - 579
[46] WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models
Yuan, Sha
Zhao, Hanyu
Du, Zhengxiao
Ding, Ming
Liu, Xiao
Cen, Yukuo
Zou, Xu
Yang, Zhilin
Tang, Jie
AI OPEN, 2021, 2 : 65 - 68
[47] CombLM: Adapting Black-Box Language Models through Small Fine-Tuned Models
Ormazabal, Aitor
Artetxe, Mikel
Agirre, Eneko
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 2961 - 2974
[48] The Fine-Tuned Large Language Model for Extracting the Progressive Bone Metastasis from Unstructured Radiology Reports
Kanemaru, Noriko
Yasaka, Koichiro
Fujita, Nana
Kanzawa, Jun
Abe, Osamu
JOURNAL OF IMAGING INFORMATICS IN MEDICINE, 2025, 38 (02): : 865 - 872
[49] Extracting text from scanned Arabic books: a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model
Elanwar, Randa
Qin, Wenda
Betke, Margrit
Wijaya, Derry
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2021, 24 (04) : 349 - 362
[50] Extracting text from scanned Arabic books: a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model
Randa Elanwar
Wenda Qin
Margrit Betke
Derry Wijaya
International Journal on Document Analysis and Recognition (IJDAR), 2021, 24 : 349 - 362

← 1 2 3 4 5 →