Gender identification on Twitter

被引:14
|
作者
Ikae, Catherine [1 ]
Savoy, Jacques [1 ]
机构
[1] Univ Neuchatel, Comp Sci Dept, Neuchatel, Switzerland
关键词
STYLE;
D O I
10.1002/asi.24541
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
To determine the author of a text's gender, various feature types have been suggested (e.g., function words, n-gram of letters, etc.) leading to a huge number of stylistic markers. To determine the target category, different machine learning models have been suggested (e.g., logistic regression, decision tree, k nearest-neighbors, support vector machine, naive Bayes, neural networks, and random forest). In this study, our first objective is to know whether or not the same model always proposes the best effectiveness when considering similar corpora under the same conditions. Thus, based on 7 CLEF-PAN collections, this study analyzes the effectiveness of 10 different classifiers. Our second aim is to propose a 2-stage feature selection to reduce the feature size to a few hundred terms without any significant change in the performance level compared to approaches using all the attributes (increase of around 5% after applying the proposed feature selection). Based on our experiments, neural network or random forest tend, on average, to produce the highest effectiveness. Moreover, empirical evidence indicates that reducing the feature set size to around 300 without penalizing the effectiveness is possible. Finally, based on such reduced feature sizes, an analysis reveals some of the specific terms that clearly discriminate between the 2 genders.
引用
收藏
页码:58 / 69
页数:12
相关论文
共 50 条
  • [41] Topic Modelling for Identification of Vaccine Reactions in Twitter
    Habibabadi, Sedigheh Khademi
    Haghighi, Pari Delir
    PROCEEDINGS OF THE AUSTRALASIAN COMPUTER SCIENCE WEEK MULTICONFERENCE (ACSW 2019), 2019,
  • [42] Arabic Dialect Identification for Travel and Twitter Text
    Mishra, Pruthwik
    Mujadia, Vandan
    FOURTH ARABIC NATURAL LANGUAGE PROCESSING WORKSHOP (WANLP 2019), 2019, : 234 - 238
  • [43] Automatic identification of Irony: a Case Study on Twitter
    Tavares Alves, Yulli Dias
    Sanches, Ana Luiza
    Dalip, Daniel H.
    Silva, Ismael S.
    WEBMEDIA 2019: PROCEEDINGS OF THE 25TH BRAZILLIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB, 2019, : 253 - 256
  • [44] Sarcasm Identification on Twitter: A Machine Learning Approach
    Onan, Aytug
    ARTIFICIAL INTELLIGENCE TRENDS IN INTELLIGENT SYSTEMS, CSOC2017, VOL 1, 2017, 573 : 374 - 383
  • [45] Analysing Twitter Data for Phishing Tweets Identification
    Al-Akashi, Falah Hassan Ali
    INTERNATIONAL JOURNAL OF INTELLIGENT INFORMATION TECHNOLOGIES, 2021, 17 (02) : 96 - 106
  • [46] Automatic Identification and Classification of Misogynistic Language on Twitter
    Anzovino, Maria
    Fersini, Elisabetta
    Rosso, Paolo
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2018), 2018, 10859 : 57 - 64
  • [47] Topic Identification System to Filter Twitter Feeds
    Altammami, Shatha Hamad
    Rana, Omer F.
    2016 3RD INTERNATIONAL CONFERENCE ON SOFT COMPUTING & MACHINE INTELLIGENCE (ISCMI 2016), 2016, : 206 - 213
  • [48] Predictive modeling for suspicious content identification on Twitter
    Surendra Singh Gangwar
    Santosh Singh Rathore
    Satyendra Singh Chouhan
    Sanskar Soni
    Social Network Analysis and Mining, 2022, 12
  • [49] Hyperlocal Home Location Identification of Twitter Profiles
    Poulston, Adam
    Stevenson, Mark
    Bontcheva, Kalina
    PROCEEDINGS OF THE 28TH ACM CONFERENCE ON HYPERTEXT AND SOCIAL MEDIA (HT'17), 2017, : 45 - 54
  • [50] Sentiment identification on Twitter using machine learning
    Morales-Castro, Wendy
    Careta, Eduardo Perez
    Rayas, Angelica Hernandez
    Mukhopadhyay, Tirtha Prasad
    Crespo, J. Armando Perez
    Cabrera, Rafael Guzman
    2022 EURO-ASIA CONFERENCE ON FRONTIERS OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY, FCSIT, 2022, : 28 - 31