On the role of syntactic dependencies and discourse relations for author and gender identification

被引:10
|
作者
Soler-Company, Juan [1 ]
Wanner, Leo [1 ,2 ]
机构
[1] Pompeu Fabra Univ, Carrer de Roc Boronat 138, Barcelona 08018, Spain
[2] ICREA, Carrer de Roc Boronat 138, Barcelona 08018, Spain
关键词
Author profiling; Author identification; Gender identification; Text classification; ATTRIBUTION;
D O I
10.1016/j.patrec.2017.12.006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Author and author gender identification are two major tasks in the context of profiling of authors of written material. Author identification (or, more precisely, "authorship attribution") copes with the assignment of the author, who is to be chosen from a given list of author names, to a piece of written material. Gender identification deals with the prediction of the gender of the author (male vs. female). Both tasks are very relevant to a number of applications, including, e.g., plagiarism and deception detection, document authenticity verification, and blackmailing. State of the art in both fields tends to rely mainly upon lexical and token (sequence) distribution features. But this means to neglect numerous linguistic studies that clearly indicate the high relevance of "deep linguistic", i.e., syntactic and discourse, features to the characterization of the style of an author or a group of authors. Our work on author and gender identification confirms this relevance. We show with two different genres, namely blog posts and literary writings, that the use of deep linguistic features is very effective. It leads to >78% (in the case of blog posts) and >91% (in the case of literary writings) of accuracy in author identification and >89% (blog posts) and >90% (literary writings) of accuracy in gender identification. (c) 2017 Elsevier B.V. All rights reserved.
引用
收藏
页码:87 / 95
页数:9
相关论文
共 50 条