Mining Insights from Large-Scale Corpora Using Fine-Tuned Language Models

被引:11
|
作者
Palakodety, Shriphani [1 ]
KhudaBukhsh, Ashiqur R. [2 ]
Carbonell, Jaime G. [2 ]
机构
[1] Onai, San Jose, CA 95129 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
关键词
ELECTION; TWITTER;
D O I
10.3233/FAIA200306
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Mining insights from large volume of social media texts with minimal supervision is a highly challenging Natural Language Processing (NLP) task. While Language Models' (LMs) efficacy in several downstream tasks is well-studied, assessing their applicability in answering relational questions, tracking perception or mining deeper insights is under-explored. Few recent lines of work have scratched the surface by studying pre-trained LMs' (e.g., BERT) capability in answering relational questions through "fill-in-the-blank" cloze statements (e.g., [Dante was born in MASK]). BERT predicts the MASK-ed word with a list of words ranked by probability (in this case, BERT successfully predicts Florence with the highest probability). In this paper, we conduct a feasibility study of fine-tuned LMs with a different focus on tracking polls, tracking community perception and mining deeper insights typically obtained through costly surveys. Our main focus is on a substantial corpus of video comments extracted from YouTube videos (6,182,868 comments on 130,067 videos by 1,518,077 users) posted within 100 days prior to the 2019 Indian General Election. Using fill-in-the-blank cloze statements against a recent high-performance language modeling algorithm, BERT, we present a novel application of this family of tools that is able to (1) aggregate political sentiment (2) reveal community perception and (3) track evolving national priorities and issues of interest.
引用
收藏
页码:1890 / 1897
页数:8
相关论文
共 50 条
  • [41] Website Category Classification Using Fine-tuned BERT Language Model
    Demirkiran, Ferhat
    Cayir, Aykut
    Unal, Ugur
    Dag, Hasan
    2020 5TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2020, : 333 - 336
  • [42] Fine-Grained Spoiler Detection from Large-Scale Review Corpora
    Wan, Mengting
    Misra, Rishabh
    Nakashole, Ndapa
    McAuley, Julian
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 2605 - 2610
  • [43] Genealogical Relationship Extraction from Unstructured Text Using Fine-Tuned Transformer Models
    Parrolivelli, Carloangello
    Stanchev, Lubomir
    2023 IEEE 17TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING, ICSC, 2023, : 167 - 174
  • [44] Assessing Programming Proficiency Through Eye Gaze Analysis Using Fine-Tuned Large Language Model
    Li, Zheng
    Holly, Dominic
    PROCEEDINGS OF THE 2024 IEEE 10TH IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE AND SMART COMPUTING, HPSC 2024, 2024, : 7 - 12
  • [45] Online aggression detection using ensemble techniques on fine-tuned transformer-based language models
    Chinivar S.
    Roopa M.S.
    Arunalatha J.S.
    Venugopal K.R.
    International Journal of Computers and Applications, 2024, 46 (08) : 567 - 579
  • [46] WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models
    Yuan, Sha
    Zhao, Hanyu
    Du, Zhengxiao
    Ding, Ming
    Liu, Xiao
    Cen, Yukuo
    Zou, Xu
    Yang, Zhilin
    Tang, Jie
    AI OPEN, 2021, 2 : 65 - 68
  • [47] CombLM: Adapting Black-Box Language Models through Small Fine-Tuned Models
    Ormazabal, Aitor
    Artetxe, Mikel
    Agirre, Eneko
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 2961 - 2974
  • [48] The Fine-Tuned Large Language Model for Extracting the Progressive Bone Metastasis from Unstructured Radiology Reports
    Kanemaru, Noriko
    Yasaka, Koichiro
    Fujita, Nana
    Kanzawa, Jun
    Abe, Osamu
    JOURNAL OF IMAGING INFORMATICS IN MEDICINE, 2025, 38 (02): : 865 - 872
  • [49] Extracting text from scanned Arabic books: a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model
    Elanwar, Randa
    Qin, Wenda
    Betke, Margrit
    Wijaya, Derry
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2021, 24 (04) : 349 - 362
  • [50] Extracting text from scanned Arabic books: a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model
    Randa Elanwar
    Wenda Qin
    Margrit Betke
    Derry Wijaya
    International Journal on Document Analysis and Recognition (IJDAR), 2021, 24 : 349 - 362