[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1007/978-3-030-31372-2_25guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Authorship Attribution in Russian in Real-World Forensics Scenario

Published: 14 October 2019 Publication History

Abstract

Recent demands in authorship attribution, specifically, cross-topic authorship attribution with small numbers of training samples and very short texts, impose new challenges on corpora design, feature and algorithm development. In the current work we address these challenges by performing authorship attribution on a specifically designed dataset in Russian. We present a dataset of short written texts in Russian, where both authorship and topic are controlled. We propose a pairwise classification design closely resembling a real-world forensic task. Semantic coherence features are introduced to supplement well-established n-gram features in challenging cross-topic settings. Distance-based measures are compared with machine learning algorithms. The experiment results support the intuition that for very small datasets, distance-based measures perform better than machine learning techniques. Moreover, pairwise classification results show that in difficult cross-topic cases, content-independent features, i.e., part-of-speech n-grams and semantic coherence, are promising. The results are supported by feature significance analysis for the proposed dataset.

References

[1]
Chaski C Craiger P and Shenoi S The keyboard dilemma and authorship identification Advances in Digital Forensics III 2007 New York Springer 133-146
[2]
Corcoran CM et al. Prediction of psychosis across protocols and risk cohorts using automated language analysis World Psychiatry 2018 17 1 67-75
[3]
Dmitrin, Y., Botov, D., Klenin, J., Nikolaev, I.: Comparison of deep neural network architectures for authorship attribution of Russian social media texts. In: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2018” (Online articles). RSUH (2018)
[4]
Elvevåg B, Foltz PW, Weinberger DR, and Goldberg TE Quantifying incoherence in speech: an automated methodology and novel application to schizophrenia Schizophr. Res. 2007 93 1–3 304-316
[5]
Evert Stefan, Jannidis Fotis, Proisl Thomas, Reger Isabella, Pielström Steffen, Schöch Christof, and Vitt Thorsten Understanding and explaining Delta measures for authorship attribution Digital Scholarship in the Humanities 2017 32 suppl_2 ii4-ii16
[6]
Gómez-Adorno H, et al., et al. Bellot P, et al., et al. Hierarchical clustering analysis: the best-performing approach at PAN 2017 author clustering task Experimental IR Meets Multilinguality, Multimodality, and Interaction 2018 Cham Springer 216-223
[7]
Grant T Txt 4n6: describing and measuring consistency and distinctiveness in the analysis of SMS text messages J. Law Policy 2013 XXI 2 467-494
[8]
Gritta, M.: Distributional Semantics and Authorship Differences (MPhil Diss.). University of Cambridge (2015)
[9]
Herbelot, A., Kochmar, E.: ‘Calling on the classical phone’: a distributional model of adjective-noun errors in learners’ English. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 976–986. COLING (2016)
[10]
Iter, D., Yoon, J., Jurafsky, D.: Automatic detection of incoherent speech for diagnosing schizophrenia. In: Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, pp. 136–146. Association for Computational Linguistics (2018)
[11]
Juola P The rowling protocol, Steven Bannon, and Rogue POTUS staff: a study in computational authorship attribution Language and Law/Linguagem e Direito 2018 5 2 77-94
[12]
Kestemont, M., et al.: Overview of the author identification task at PAN-2018: cross-domain authorship attribution and style change detection. In: Cappellato, L., et al. (eds.) Working Notes Papers of the CLEF 2018 Evaluation Labs, pp. 1–25. CEUR-WS.org (2018)
[13]
Kutuzov A, Kuzmenko E, et al. Ignatov Dmitry I et al. WebVectors: a toolkit for building web interfaces for vector semantic models Analysis of Images, Social Networks and Texts 2017 Cham Springer 155-161
[14]
Litvinova, T., Litvinova, O., Seredin, P.: Assessing the level of stability of idiolectal features across modes, topics and time of text production. In: 23rd Conference of Open Innovations Association: FRUCT 2018, pp. 223–230. IEEE (2018)
[15]
Litvinova T, Seredin P, Litvinova O, Dankova T, and Zagorovskaya O Karpov A, Jokisch O, and Potapova R On the stability of some idiolectal features Speech and Computer 2018 Cham Springer 331-336
[16]
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 100–108. Association for Computational Linguistics (2010)
[17]
Pedregosa F et al. Scikit-learn: machine learning in Python J. Mach. Learn. Res. 2011 12 2825-2830
[18]
Posadas-Durán JP et al. Application of the distributed document representation in the authorship attribution task for small corpora Soft. Comput. 2017 21 3 627-639
[19]
Queralt S The creation of Base Rate Knowledge of linguistic variables and the implementation of likelihood ratios to authorship attribution in forensic text comparison Language and Law/Linguagem e Direito 2018 5 2 59-76
[20]
Rocha A et al. Authorship attribution for social media forensics IEEE Trans. Inf. Forensics Secur. 2016 12 1 5-33
[21]
Sapkota, U., Bethard, S., Montes, M., Solorio, T.: Not all character n-grams are created equal: a study in authorship attribution. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 93–102. Association for Computational Linguistics (2015)
[22]
Segalovich, I.: A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In: International Conference on Machine Learning; Models, Technologies and Applications, pp. 273–280. CSREA Press (2003)
[23]
Shutova, E., Kiela, D., Maillard, J.: Black holes and white rabbits: metaphor identification with visual features. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 160–170. Association for Computational Linguistics (2016)
[24]
Sidorov G, Velasquez F, Stamatatos E, Gelbukh A, and Chanona-Hernández L Syntactic n-grams as machine learning features for natural language processing Expert Syst. Appl. 2014 41 3 853-860
[25]
Soboroff, I.M., Nicholas, C.K., Kukla, J.M., Ebert, D.S.: Visualizing document authorship using n-grams and latent semantic indexing. In: Proceedings of the 1997 Workshop on New Paradigms in Information Visualization and Manipulation, pp. 43–48. ACM (1997)
[26]
Stamatatos E, Fakotakis N, and Kokkinakis G Computer-based authorship attribution without lexical measures Comput. Humanit. 2001 35 2 193-214
[27]
Stamatatos E Masking topic-related information to enhance authorship attribution J. Assoc. Inf. Sci. Technol. 2018 69 3 461-473
[28]
Stamatatos E On the robustness of authorship attribution based on character n-gram features J. Law Policy 2013 21 2 421-439
[29]
Tschuggnall, M., et al.: Overview of the author identification task at PAN-2017: style breach detection and author clustering. In: Working Notes of CLEF 2017, CEUR Workshop Proceedings, vol. 1866. CEUR-WS.org (2017)

Index Terms

  1. Authorship Attribution in Russian in Real-World Forensics Scenario
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Guide Proceedings
        Statistical Language and Speech Processing: 7th International Conference, SLSP 2019, Ljubljana, Slovenia, October 14–16, 2019, Proceedings
        Oct 2019
        325 pages
        ISBN:978-3-030-31371-5
        DOI:10.1007/978-3-030-31372-2

        Publisher

        Springer-Verlag

        Berlin, Heidelberg

        Publication History

        Published: 14 October 2019

        Author Tags

        1. Authorship identification
        2. Plagiarism and spam filtering
        3. Forensic authorship identification
        4. Distributional semantics
        5. Russian language

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 0
          Total Downloads
        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 25 Dec 2024

        Other Metrics

        Citations

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media