Abstract
The rapid growth of the Internet and the increasing availability of electronic documents poses some problems, such as identification of an anonymous text and plagiarism. This study aims to determine the author of a given document among the set of text documents whose author is known. Despite the excess number of researches conducted in English language for author identification in the last century, Turkish and other languages are gaining interest only in the last decade. Therefore, this study deals with the Author Identification problem using two different Turkish datasets, collected from two different Turkish newspapers. The datasets comprises 850 columns written by 17 columnists as a total, 50 columns from each columnist. 4 different Machine Learning algorithms (Naive Bayes, Support Vector Machine, the K-Nearest Neighbor and Decision Tree) have been employed and 99.7 % accuracy is achieved with K-Nearest Neighbor algorithm. The classification fully recognized with Chi-square feature selection method by reducing the features from 20 to 17.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Mosteller, F., Wallace, D.: Inference and disputed authorship: the federalist. Adison Wesley (1964)
Mikros, G.K., Perifanos, K.: Authorship identification in large email collections: experiments using features that belong to different linguistic levels, CLEF (2011)
Hill, S., Provost, F.: The myth of the double-blind review\(? \)author identification using only citations. ACM SIGKDD Explor. Newsl. 5(2), 179–184 (2003)
Zhao, J., Zhan, G., Feng, J.: Disputed authorship in C program code after detection of plagiarism. Int. Conf. Comput. Sci. Softw. Eng. 1, 86–89 (2008)
de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. SIGMOD Rec. 30(4), 55–64 (2001)
Gray, A., Sallis, P., MacDonnel, S.: Software forensics: extending authorship analysis techniques to computer programs. In: Biannual Conference of the International Association of Forensic Linguists (IAFL’97), pp. 1–8 (1997)
Cheng, N., Chen, X., Chandramouli, R., Subbalakshmi, K.P.: Gender identification from E-mails. In: IEEE Symposium on Computational Intelligence and Data Mining, CIDM ’09, pp. 154–158 (2009)
Bandara, U., Wijayarathna, G.: Source code author identification with unsupervised feature learning. Pattern Recogn. Lett. 34(3), 330–334
Coulthard, M.: Author identification. Idiolects Linguist. Uniquenes Appl. Linguist. 25(4), 431–447 (2004)
Pavelec, D., Justino, E., Oliveira, L.S.: Author Identification using Stylometric Features. Inteligencia Artif. Rev. Iberoamericana de Inteligencia Artif. 11(36), 59–65 (2007)
Bozkurt, D., Baglioglu, O., Uyar, E.: Authorship attribution: performance of various features and classification methods computer and information sciences (2007)
Taş, T., Görür, A.: Author identification for Turkish texts. J. Arts Sci. 7, 151–161 (2007)
Türkoğlu, F., Diri, B., Amasyalı, M.F.: Author attribution of Turkish texts by feature mining. In: Proceedings of the 3rd International Conference on Intelligent Computing, ICIC 2007 Qingdao, China, LNCS 4681/2007 (2007)
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556
Stamatatos, E.: Author identification using imbalanced and limited training text. In: Proceedings of the 18th International Conference and Database and Expert Systems Applications, Regensburg, pp. 237–41. IEEE Computer Society, Germany (2007)
Luyckx, K., Daelemans, W.: The effect of author set size and data size in authorship attribution. Literary Linguist. Comput. 26(1), 35–55 (2011)
Grieve, J.: Quantitative authorship attribution: an evaluation of techniques. Literary Linguist. Comput. 22, 251–270 (2007)
Vapnik, V.: The nature of statistical learning theory. Springer, New York (1995)
Manning, C.D., Raghavan, P., Schütze, H.: Information retrieval. Cambridge University Press (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Bay, Y., Çelebi, E. (2016). Feature Selection for Enhanced Author Identification of Turkish Text. In: Abdelrahman, O., Gelenbe, E., Gorbil, G., Lent, R. (eds) Information Sciences and Systems 2015. Lecture Notes in Electrical Engineering, vol 363. Springer, Cham. https://doi.org/10.1007/978-3-319-22635-4_34
Download citation
DOI: https://doi.org/10.1007/978-3-319-22635-4_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22634-7
Online ISBN: 978-3-319-22635-4
eBook Packages: EngineeringEngineering (R0)