More Web Proxy on the site http://driver.im/

Article

Authorship Attribution in Russian in Real-World Forensics Scenario

Authors:

Polina Panicheva,

Tatiana LitvinovaAuthors Info & Claims

Statistical Language and Speech Processing: 7th International Conference, SLSP 2019, Ljubljana, Slovenia, October 14–16, 2019, Proceedings

Pages 299 - 310

https://doi.org/10.1007/978-3-030-31372-2_25

Published: 14 October 2019 Publication History

Abstract

Recent demands in authorship attribution, specifically, cross-topic authorship attribution with small numbers of training samples and very short texts, impose new challenges on corpora design, feature and algorithm development. In the current work we address these challenges by performing authorship attribution on a specifically designed dataset in Russian. We present a dataset of short written texts in Russian, where both authorship and topic are controlled. We propose a pairwise classification design closely resembling a real-world forensic task. Semantic coherence features are introduced to supplement well-established n-gram features in challenging cross-topic settings. Distance-based measures are compared with machine learning algorithms. The experiment results support the intuition that for very small datasets, distance-based measures perform better than machine learning techniques. Moreover, pairwise classification results show that in difficult cross-topic cases, content-independent features, i.e., part-of-speech n-grams and semantic coherence, are promising. The results are supported by feature significance analysis for the proposed dataset.

References

[1]

Chaski C Craiger P and Shenoi S The keyboard dilemma and authorship identification Advances in Digital Forensics III 2007 New York Springer 133-146

[2]

Corcoran CM et al. Prediction of psychosis across protocols and risk cohorts using automated language analysis World Psychiatry 2018 17 1 67-75

[3]

Dmitrin, Y., Botov, D., Klenin, J., Nikolaev, I.: Comparison of deep neural network architectures for authorship attribution of Russian social media texts. In: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2018” (Online articles). RSUH (2018)

[4]

Elvevåg B, Foltz PW, Weinberger DR, and Goldberg TE Quantifying incoherence in speech: an automated methodology and novel application to schizophrenia Schizophr. Res. 2007 93 1–3 304-316

[5]

Evert Stefan, Jannidis Fotis, Proisl Thomas, Reger Isabella, Pielström Steffen, Schöch Christof, and Vitt Thorsten Understanding and explaining Delta measures for authorship attribution Digital Scholarship in the Humanities 2017 32 suppl_2 ii4-ii16

[6]

Gómez-Adorno H, et al., et al. Bellot P, et al., et al. Hierarchical clustering analysis: the best-performing approach at PAN 2017 author clustering task Experimental IR Meets Multilinguality, Multimodality, and Interaction 2018 Cham Springer 216-223

[7]

Grant T Txt 4n6: describing and measuring consistency and distinctiveness in the analysis of SMS text messages J. Law Policy 2013 XXI 2 467-494

[8]

Gritta, M.: Distributional Semantics and Authorship Differences (MPhil Diss.). University of Cambridge (2015)

[9]

Herbelot, A., Kochmar, E.: ‘Calling on the classical phone’: a distributional model of adjective-noun errors in learners’ English. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 976–986. COLING (2016)

[10]

Iter, D., Yoon, J., Jurafsky, D.: Automatic detection of incoherent speech for diagnosing schizophrenia. In: Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, pp. 136–146. Association for Computational Linguistics (2018)

[11]

Juola P The rowling protocol, Steven Bannon, and Rogue POTUS staff: a study in computational authorship attribution Language and Law/Linguagem e Direito 2018 5 2 77-94

[12]

Kestemont, M., et al.: Overview of the author identification task at PAN-2018: cross-domain authorship attribution and style change detection. In: Cappellato, L., et al. (eds.) Working Notes Papers of the CLEF 2018 Evaluation Labs, pp. 1–25. CEUR-WS.org (2018)

[13]

Kutuzov A, Kuzmenko E, et al. Ignatov Dmitry I et al. WebVectors: a toolkit for building web interfaces for vector semantic models Analysis of Images, Social Networks and Texts 2017 Cham Springer 155-161

[14]

Litvinova, T., Litvinova, O., Seredin, P.: Assessing the level of stability of idiolectal features across modes, topics and time of text production. In: 23rd Conference of Open Innovations Association: FRUCT 2018, pp. 223–230. IEEE (2018)

[15]

Litvinova T, Seredin P, Litvinova O, Dankova T, and Zagorovskaya O Karpov A, Jokisch O, and Potapova R On the stability of some idiolectal features Speech and Computer 2018 Cham Springer 331-336

[16]

Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 100–108. Association for Computational Linguistics (2010)

[17]

Pedregosa F et al. Scikit-learn: machine learning in Python J. Mach. Learn. Res. 2011 12 2825-2830

[18]

Posadas-Durán JP et al. Application of the distributed document representation in the authorship attribution task for small corpora Soft. Comput. 2017 21 3 627-639

[19]

Queralt S The creation of Base Rate Knowledge of linguistic variables and the implementation of likelihood ratios to authorship attribution in forensic text comparison Language and Law/Linguagem e Direito 2018 5 2 59-76

[20]

Rocha A et al. Authorship attribution for social media forensics IEEE Trans. Inf. Forensics Secur. 2016 12 1 5-33

[21]

Sapkota, U., Bethard, S., Montes, M., Solorio, T.: Not all character n-grams are created equal: a study in authorship attribution. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 93–102. Association for Computational Linguistics (2015)

[22]

Segalovich, I.: A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In: International Conference on Machine Learning; Models, Technologies and Applications, pp. 273–280. CSREA Press (2003)

[23]

Shutova, E., Kiela, D., Maillard, J.: Black holes and white rabbits: metaphor identification with visual features. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 160–170. Association for Computational Linguistics (2016)

[24]

Sidorov G, Velasquez F, Stamatatos E, Gelbukh A, and Chanona-Hernández L Syntactic n-grams as machine learning features for natural language processing Expert Syst. Appl. 2014 41 3 853-860

[25]

Soboroff, I.M., Nicholas, C.K., Kukla, J.M., Ebert, D.S.: Visualizing document authorship using n-grams and latent semantic indexing. In: Proceedings of the 1997 Workshop on New Paradigms in Information Visualization and Manipulation, pp. 43–48. ACM (1997)

[26]

Stamatatos E, Fakotakis N, and Kokkinakis G Computer-based authorship attribution without lexical measures Comput. Humanit. 2001 35 2 193-214

[27]

Stamatatos E Masking topic-related information to enhance authorship attribution J. Assoc. Inf. Sci. Technol. 2018 69 3 461-473

[28]

Stamatatos E On the robustness of authorship attribution based on character n-gram features J. Law Policy 2013 21 2 421-439

[29]

Tschuggnall, M., et al.: Overview of the author identification task at PAN-2017: style breach detection and author clustering. In: Working Notes of CLEF 2017, CEUR Workshop Proceedings, vol. 1866. CEUR-WS.org (2017)

Index Terms

Authorship Attribution in Russian in Real-World Forensics Scenario
1. Applied computing
  1. Arts and humanities
    1. Language translation
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification

Index terms have been assigned to the content through auto-classification.

Recommendations

Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features
NLPIR '19: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval

Authorship attribution is an important field in online security. Recently there have been numerous successful works in authorship attribution in various European languages. Character n-grams are reported to be the best choice in authorship attribution, ...
Survey of Authorship Identification Tasks on Arabic Texts
Authorship identification is the process of extracting and analysing the writing styles of authors to identify the authorship. From the writing style, the author and his/her different characteristics can be recognised, which is very useful in digital ...
A scalable framework for cross-lingual authorship identification
Abstract
Cross-lingual authorship identification aims at finding the author of an anonymous document written in one language by using labeled documents written in other languages. The main challenge of cross-lingual authorship identification is that the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

Statistical Language and Speech Processing: 7th International Conference, SLSP 2019, Ljubljana, Slovenia, October 14–16, 2019, Proceedings

Oct 2019

325 pages

ISBN:978-3-030-31371-5

DOI:10.1007/978-3-030-31372-2

Editors:
Carlos Martín-Vide
Rovira i Virgili University, Tarragona, Spain
,
Matthew Purver
Queen Mary University of London, London, UK
,
Senja Pollak
Jožef Stefan Institute, Ljubljana, Slovenia

© Springer Nature Switzerland AG 2019.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 14 October 2019

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten