Gender identification on Twitter

被引:14
|
作者
Ikae, Catherine [1 ]
Savoy, Jacques [1 ]
机构
[1] Univ Neuchatel, Comp Sci Dept, Neuchatel, Switzerland
关键词
STYLE;
D O I
10.1002/asi.24541
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
To determine the author of a text's gender, various feature types have been suggested (e.g., function words, n-gram of letters, etc.) leading to a huge number of stylistic markers. To determine the target category, different machine learning models have been suggested (e.g., logistic regression, decision tree, k nearest-neighbors, support vector machine, naive Bayes, neural networks, and random forest). In this study, our first objective is to know whether or not the same model always proposes the best effectiveness when considering similar corpora under the same conditions. Thus, based on 7 CLEF-PAN collections, this study analyzes the effectiveness of 10 different classifiers. Our second aim is to propose a 2-stage feature selection to reduce the feature size to a few hundred terms without any significant change in the performance level compared to approaches using all the attributes (increase of around 5% after applying the proposed feature selection). Based on our experiments, neural network or random forest tend, on average, to produce the highest effectiveness. Moreover, empirical evidence indicates that reducing the feature set size to around 300 without penalizing the effectiveness is possible. Finally, based on such reduced feature sizes, an analysis reveals some of the specific terms that clearly discriminate between the 2 genders.
引用
收藏
页码:58 / 69
页数:12
相关论文
共 50 条
  • [21] Gender Classification using Twitter Text Data
    Vashisth, Pradeep
    Meehan, Kevin
    2020 31ST IRISH SIGNALS AND SYSTEMS CONFERENCE (ISSC), 2020, : 56 - 61
  • [22] Feminism, gender identity and polarization in TikTok and Twitter
    Pena-Fernandez, Simon
    Larrondo-Ureta, Ainara
    Morales-i-Gras, Jordi
    COMUNICAR, 2023, 31 (75) : 49 - 60
  • [23] Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification
    Zaghouani, Wajdi
    Charfi, Anis
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 694 - 700
  • [24] A systematic identification and analysis of scientists on Twitter
    Ke, Qing
    Ahn, Yong-Yeol
    Sugimoto, Cassidy R.
    PLOS ONE, 2017, 12 (04):
  • [25] Home Location Identification of Twitter Users
    Mahmud, Jalal
    Nichols, Jeffrey
    Drews, Clemens
    ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2014, 5 (03)
  • [26] Negative Purchase Intent Identification in Twitter
    Atouati, Samed
    Lu, Xiao
    Sozio, Mauro
    WEB CONFERENCE 2020: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2020), 2020, : 2796 - 2802
  • [27] Teaching Bird Identification & Vocabulary with Twitter
    Hallman, Tyler A.
    Robinson, W. Douglas
    AMERICAN BIOLOGY TEACHER, 2015, 77 (06): : 458 - 461
  • [28] Bots and Gender Detection on Twitter Using Stylistic Features
    Ouni, Sarra
    Fkih, Fethi
    Omri, Mohamed Nazih
    ADVANCES IN COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2022, 2022, 1653 : 650 - 660
  • [29] Twitter gender classification using user unstructured information
    Vicente, Marco
    Batista, Fernando
    Carvalho, Joao Paulo
    2015 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ-IEEE 2015), 2015,
  • [30] Analysis from a gender perspective of the Olympic Games on Twitter
    Ada-Lameiras, Alba
    Rodriguez-Castro, Yolanda
    EUROPEAN SPORT MANAGEMENT QUARTERLY, 2023, 23 (03) : 683 - 699