Abstract
The necessity to operate with the huge number of anonymous documents abounding on the Internet is initiating the study of new methods for authorship recognition. The principal weakness of the methods used in this area is that they assess the similarity of text styles without any regard to their surroundings. This paper proposes a novel mathematical model of the writing process striving to quantify this dependency. A text is divided into a series of sequential sub-documents, which are represented via term histograms. The histograms proximity is estimated through a simple probability distance. Intending to typify the text writing style, a new characteristic representing the mean distance between a current sub-document and numerous earlier ones is advanced. An empirical distribution over the whole document of this feature specifies the writing style. So, dissimilarity of such distributions indicates a difference in the writing styles, and their coincidence implies the styles’ identity. Numerical experiments demonstrate high potential ability of the proposed approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Binongo, J.: Who wrote the 15th book of Oz? An application of multivariate analysis to authorship attribution. Chance 16(C), 9–17 (2003)
Bolshoy, A., Volkovich, Z., Kirzhner, V., Barzily, Z.: Genome clustering: from linguistic models to classification of genetic texts, vol. 286. Springer Science & Business Media (2010)
Brown, P.F., Pietra, V.J.D., deSouza, P.V., Lai, J.C., Mercer, R.L.: Class-based \(n\)-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)
Cha, S.-H.: Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences 1(4), 300–307 (2007)
Collins, J., Kaufer, D., Vlachos, P., Butler, B., Ishizaki, S.: Detecting collaborations in text: Comparing the authors’ rhetorical language choices in the federalist papers. Computers and the Humanities 38, 15–36 (2004)
Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship attribution using word sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 844–853. Springer, Heidelberg (2006)
Diederich, J., Kindermann, J., Leopold, E., Paas, G.: Authorship attribution with support vector machines. Applied Intelligence 19(1), 109–123 (2003)
Eissen, S.M., Stein, B., Kulig, M.: Plagiarism detection without reference collections. Springer, Berlin (2007)
Forsyth, R.: New directions in text categorization. Springer, Heidelberg (1999)
Frantzeskou, G., Stamatatos, E., Gritzalis, S., Katsikas, S.: Effective identification of source code authors using byte-level information. In: Proceedings of the 28th International Conference on Software Engineering, pp. 893–896. ACM Press, NewYork (2006)
Fristedt, B.E., Gray, L.F.: A Modern Approach to Probability Theory. Probability and Its Applications. Birkhäuser, Boston (1996)
Harmer, J.: How to Teach Writing. Pearson Education (2006)
Hughes, J.M., Foti, N.J., Krakauer, D.C., Rockmore, D.N.: Quantitative patterns of stylistic influence in the evolution of literature. Proc. Natl. Acad. Sci. USA 109(20), 7682–7686 (2012)
Ionescu, R.T., Popescu, M.: Pq kernel. Pattern Recogn. Lett. 55(C), 51–57 (2015)
Juola, P.: Authorship attribution. Foundations and Trends in Information Retrieval 1(3), 233–334 (2006)
Kendall, M.G., Gibbons, J.D.: Rank Correlation Methods. Edward Arnold, London (1990)
Kestemont, M., Luyckx, K., Daelemans, W., Crombez, T.: Cross-genre authorship verification using unmasking. English Studies 93(3), 340–356 (2012)
Kolmogorov, A.: Sulla determinazione empirica di una legge di distribuzione. G. Ist. Ital. Attuari 4 (1933)
Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary and Linguistic Computing 17(4), 401–412 (2002)
Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Proceedings of the 21st International Conferenceon Machine Learning. Press (2004)
Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research 8, 1261–1276 (2007)
Koppel, M., Winter, Y.: Determining if two documents are written by the same author. Journal of the American Society for Information Science and Technology 65(1), 178–187 (2014)
Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. JASIST 60(1), 9–26 (2009)
Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of the Twenty-Second International Conference on Computational Linguistics (COLING 2008), pp. 513–520 (2008)
Manning, C., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (2003)
Miao, Y., Kešelj, V., Milios, E.: Document clustering using character n-grams: a comparative evaluation with term-based and word-based clustering. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, CIKM 2005, pp. 357–358. ACM, New York (2005)
Peng, F., Schuurmans, D., Keselj, V., Wang, S.: Augmenting naive bayes classifiers with statistical languages models. Information Retrieval 7, 317–345 (2004)
Rachev, S.T.: Probability metrics and the stability of stochastic models. Wiley series in probability and mathematical statistics: Applied probability and statistics. Wiley (1991)
Rudman, J.: The state of authorship attribution studies: Some problems and solutions. Computers and the Humanities 31, 351–365 (1998)
Ryabko, D., Ryabko, B.: Nonparametric statistical inference for ergodic processes. IEEE Transactions on Information Theory 56(3), 1430–1435 (2010)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Sanderson, C., Guenter, S.: Short text authorship attribution via sequence kernels, markov chains and author unmasking: an investigation. In: Proceedings of the International Conference on Empirical Methods in Natural Language Processing, pp. 482–491 (2006)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Smirnov, N.: Table for estimating the goodness of fit of empirical distributions. Annals of Mathematical Statistics 19 (1948)
Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60(3), 538–556 (2009)
Stamatatos, E., Daelemans, W., Verhoeven, B., Juola, P., Lopez Lopez, A., Potthast, M., Stein, B.: Overview of the author identification task at pan 2015. In: Cappellato, L., Ferro, N., Gareth, J., San Juan, E. (eds.) Working Notes Papers of the CLEF 2015 Evaluation Labs (2015)
Stamatatos, E., Daelemans, W., Verhoeven, B., Stein, B., Potthast, M., Juolaand, P., Sanchez-Perez, M.A., Barron-Cedeno, A.: Overview of the author identification task at pan 2014. In: Working Notes for CLEF 2014 Conference, Sheffield, UK, pp. 877–897 (2014)
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Computational Linguistics 26(4), 461–485 (2000)
Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 174–189. Springer, Heidelberg (2005)
Zolotarev, V.M.: Modern Theory of Summation of Random Variables. Modern Probability & Statistics Series. VSP (1997)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Volkovich, Z. (2016). A Time Series Model of the Writing Process. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2016. Lecture Notes in Computer Science(), vol 9729. Springer, Cham. https://doi.org/10.1007/978-3-319-41920-6_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-41920-6_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41919-0
Online ISBN: 978-3-319-41920-6
eBook Packages: Computer ScienceComputer Science (R0)