Gender identification on Twitter

被引：14

作者：

Ikae, Catherine ^{[1
]}

Savoy, Jacques ^{[1
]}

机构：

[1] Univ Neuchatel, Comp Sci Dept, Neuchatel, Switzerland

来源：

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY | 2022年 / 73卷 / 01期

关键词：

STYLE;

D O I：

10.1002/asi.24541

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

To determine the author of a text's gender, various feature types have been suggested (e.g., function words, n-gram of letters, etc.) leading to a huge number of stylistic markers. To determine the target category, different machine learning models have been suggested (e.g., logistic regression, decision tree, k nearest-neighbors, support vector machine, naive Bayes, neural networks, and random forest). In this study, our first objective is to know whether or not the same model always proposes the best effectiveness when considering similar corpora under the same conditions. Thus, based on 7 CLEF-PAN collections, this study analyzes the effectiveness of 10 different classifiers. Our second aim is to propose a 2-stage feature selection to reduce the feature size to a few hundred terms without any significant change in the performance level compared to approaches using all the attributes (increase of around 5% after applying the proposed feature selection). Based on our experiments, neural network or random forest tend, on average, to produce the highest effectiveness. Moreover, empirical evidence indicates that reducing the feature set size to around 300 without penalizing the effectiveness is possible. Finally, based on such reduced feature sizes, an analysis reveals some of the specific terms that clearly discriminate between the 2 genders.

引用

页码：58 / 69

页数：12

共 50 条

[21] Gender Classification using Twitter Text Data
Vashisth, Pradeep
Meehan, Kevin
2020 31ST IRISH SIGNALS AND SYSTEMS CONFERENCE (ISSC), 2020, : 56 - 61
[22] Feminism, gender identity and polarization in TikTok and Twitter
Pena-Fernandez, Simon
Larrondo-Ureta, Ainara
Morales-i-Gras, Jordi
COMUNICAR, 2023, 31 (75) : 49 - 60
[23] Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification
Zaghouani, Wajdi
Charfi, Anis
PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 694 - 700
[24] A systematic identification and analysis of scientists on Twitter
Ke, Qing
Ahn, Yong-Yeol
Sugimoto, Cassidy R.
PLOS ONE, 2017, 12 (04):
[25] Home Location Identification of Twitter Users
Mahmud, Jalal
Nichols, Jeffrey
Drews, Clemens
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2014, 5 (03)
[26] Negative Purchase Intent Identification in Twitter
Atouati, Samed
Lu, Xiao
Sozio, Mauro
WEB CONFERENCE 2020: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2020), 2020, : 2796 - 2802
[27] Teaching Bird Identification & Vocabulary with Twitter
Hallman, Tyler A.
Robinson, W. Douglas
AMERICAN BIOLOGY TEACHER, 2015, 77 (06): : 458 - 461
[28] Bots and Gender Detection on Twitter Using Stylistic Features
Ouni, Sarra
Fkih, Fethi
Omri, Mohamed Nazih
ADVANCES IN COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2022, 2022, 1653 : 650 - 660
[29] Twitter gender classification using user unstructured information
Vicente, Marco
Batista, Fernando
Carvalho, Joao Paulo
2015 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ-IEEE 2015), 2015,
[30] Analysis from a gender perspective of the Olympic Games on Twitter
Ada-Lameiras, Alba
Rodriguez-Castro, Yolanda
EUROPEAN SPORT MANAGEMENT QUARTERLY, 2023, 23 (03) : 683 - 699

← 1 2 3 4 5 →