Authors:
Julien Hay
1
;
2
;
3
;
Bich-Liên Doan
1
;
2
;
Fabrice Popineau
1
;
2
and
Ouassim Ait Elhara
3
Affiliations:
1
CentraleSupélec, Paris-Saclay University, 91190 Gif-sur-Yvette, France
;
2
Laboratoire de Recherche en Informatique, Paris-Saclay University, 91190 Gif-sur-Yvette, France
;
3
Octopeek SAS, 95880 Enghien-les-Bains, France
Keyword(s):
Writing Style, Authorship Analysis, Representation Learning, Deep Learning, Filtering, Preprocessing.
Abstract:
Authorship analysis aims at studying writing styles to predict authorship of a portion of a written text. Our main task is to represent documents so that they reflect authorship. To reach the goal, we use these representations for the authorship attribution, which means the author of a document is identified out of a list of known authors. We have recently shown that style can be generalized to a set of reference authors. We trained a DNN to identify the authors of a large reference corpus and then learnt how to represent style in a general stylometric space. By using such a representation learning method, we can embed new documents into this stylometric space, and therefore stylistic features can be highlighted. In this paper, we want to validate the following hypothesis: the more authorship terms are filtered, the more models can be generalized. Attention can thus be focused on style-related and constituent linguistic structures in authors’ styles. To reach this aim, we suggest a n
ew efficient and highly scalable filtering process. This process permits a higher accuracy on various test sets on both authorship attribution and clustering tasks.
(More)