Gender identification on Twitter

被引：14

作者：

Ikae, Catherine ^{[1
]}

Savoy, Jacques ^{[1
]}

机构：

[1] Univ Neuchatel, Comp Sci Dept, Neuchatel, Switzerland

来源：

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY | 2022年 / 73卷 / 01期

关键词：

STYLE;

D O I：

10.1002/asi.24541

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

To determine the author of a text's gender, various feature types have been suggested (e.g., function words, n-gram of letters, etc.) leading to a huge number of stylistic markers. To determine the target category, different machine learning models have been suggested (e.g., logistic regression, decision tree, k nearest-neighbors, support vector machine, naive Bayes, neural networks, and random forest). In this study, our first objective is to know whether or not the same model always proposes the best effectiveness when considering similar corpora under the same conditions. Thus, based on 7 CLEF-PAN collections, this study analyzes the effectiveness of 10 different classifiers. Our second aim is to propose a 2-stage feature selection to reduce the feature size to a few hundred terms without any significant change in the performance level compared to approaches using all the attributes (increase of around 5% after applying the proposed feature selection). Based on our experiments, neural network or random forest tend, on average, to produce the highest effectiveness. Moreover, empirical evidence indicates that reducing the feature set size to around 300 without penalizing the effectiveness is possible. Finally, based on such reduced feature sizes, an analysis reveals some of the specific terms that clearly discriminate between the 2 genders.

引用

页码：58 / 69

页数：12

共 50 条

[41] Topic Modelling for Identification of Vaccine Reactions in Twitter
Habibabadi, Sedigheh Khademi
Haghighi, Pari Delir
PROCEEDINGS OF THE AUSTRALASIAN COMPUTER SCIENCE WEEK MULTICONFERENCE (ACSW 2019), 2019,
[42] Arabic Dialect Identification for Travel and Twitter Text
Mishra, Pruthwik
Mujadia, Vandan
FOURTH ARABIC NATURAL LANGUAGE PROCESSING WORKSHOP (WANLP 2019), 2019, : 234 - 238
[43] Automatic identification of Irony: a Case Study on Twitter
Tavares Alves, Yulli Dias
Sanches, Ana Luiza
Dalip, Daniel H.
Silva, Ismael S.
WEBMEDIA 2019: PROCEEDINGS OF THE 25TH BRAZILLIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB, 2019, : 253 - 256
[44] Sarcasm Identification on Twitter: A Machine Learning Approach
Onan, Aytug
ARTIFICIAL INTELLIGENCE TRENDS IN INTELLIGENT SYSTEMS, CSOC2017, VOL 1, 2017, 573 : 374 - 383
[45] Analysing Twitter Data for Phishing Tweets Identification
Al-Akashi, Falah Hassan Ali
INTERNATIONAL JOURNAL OF INTELLIGENT INFORMATION TECHNOLOGIES, 2021, 17 (02) : 96 - 106
[46] Automatic Identification and Classification of Misogynistic Language on Twitter
Anzovino, Maria
Fersini, Elisabetta
Rosso, Paolo
NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2018), 2018, 10859 : 57 - 64
[47] Topic Identification System to Filter Twitter Feeds
Altammami, Shatha Hamad
Rana, Omer F.
2016 3RD INTERNATIONAL CONFERENCE ON SOFT COMPUTING & MACHINE INTELLIGENCE (ISCMI 2016), 2016, : 206 - 213
[48] Predictive modeling for suspicious content identification on Twitter
Surendra Singh Gangwar
Santosh Singh Rathore
Satyendra Singh Chouhan
Sanskar Soni
Social Network Analysis and Mining, 2022, 12
[49] Hyperlocal Home Location Identification of Twitter Profiles
Poulston, Adam
Stevenson, Mark
Bontcheva, Kalina
PROCEEDINGS OF THE 28TH ACM CONFERENCE ON HYPERTEXT AND SOCIAL MEDIA (HT'17), 2017, : 45 - 54
[50] Sentiment identification on Twitter using machine learning
Morales-Castro, Wendy
Careta, Eduardo Perez
Rayas, Angelica Hernandez
Mukhopadhyay, Tirtha Prasad
Crespo, J. Armando Perez
Cabrera, Rafael Guzman
2022 EURO-ASIA CONFERENCE ON FRONTIERS OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY, FCSIT, 2022, : 28 - 31

← 1 2 3 4 5 →