[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Feature Selection for Enhanced Author Identification of Turkish Text

  • Conference paper
  • First Online:
Information Sciences and Systems 2015

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 363))

Abstract

The rapid growth of the Internet and the increasing availability of electronic documents poses some problems, such as identification of an anonymous text and plagiarism. This study aims to determine the author of a given document among the set of text documents whose author is known. Despite the excess number of researches conducted in English language for author identification in the last century, Turkish and other languages are gaining interest only in the last decade. Therefore, this study deals with the Author Identification problem using two different Turkish datasets, collected from two different Turkish newspapers. The datasets comprises 850 columns written by 17 columnists as a total, 50 columns from each columnist. 4 different Machine Learning algorithms (Naive Bayes, Support Vector Machine, the K-Nearest Neighbor and Decision Tree) have been employed and 99.7 % accuracy is achieved with K-Nearest Neighbor algorithm. The classification fully recognized with Chi-square feature selection method by reducing the features from 20 to 17.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 103.50
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 129.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
GBP 129.99
Price includes VAT (United Kingdom)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Mosteller, F., Wallace, D.: Inference and disputed authorship: the federalist. Adison Wesley (1964)

    Google Scholar 

  2. Mikros, G.K., Perifanos, K.: Authorship identification in large email collections: experiments using features that belong to different linguistic levels, CLEF (2011)

    Google Scholar 

  3. Hill, S., Provost, F.: The myth of the double-blind review\(? \)author identification using only citations. ACM SIGKDD Explor. Newsl. 5(2), 179–184 (2003)

    Article  Google Scholar 

  4. Zhao, J., Zhan, G., Feng, J.: Disputed authorship in C program code after detection of plagiarism. Int. Conf. Comput. Sci. Softw. Eng. 1, 86–89 (2008)

    Google Scholar 

  5. de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. SIGMOD Rec. 30(4), 55–64 (2001)

    Article  Google Scholar 

  6. Gray, A., Sallis, P., MacDonnel, S.: Software forensics: extending authorship analysis techniques to computer programs. In: Biannual Conference of the International Association of Forensic Linguists (IAFL’97), pp. 1–8 (1997)

    Google Scholar 

  7. Cheng, N., Chen, X., Chandramouli, R., Subbalakshmi, K.P.: Gender identification from E-mails. In: IEEE Symposium on Computational Intelligence and Data Mining, CIDM ’09, pp. 154–158 (2009)

    Google Scholar 

  8. Bandara, U., Wijayarathna, G.: Source code author identification with unsupervised feature learning. Pattern Recogn. Lett. 34(3), 330–334

    Google Scholar 

  9. Coulthard, M.: Author identification. Idiolects Linguist. Uniquenes Appl. Linguist. 25(4), 431–447 (2004)

    Article  Google Scholar 

  10. Pavelec, D., Justino, E., Oliveira, L.S.: Author Identification using Stylometric Features. Inteligencia Artif. Rev. Iberoamericana de Inteligencia Artif. 11(36), 59–65 (2007)

    Google Scholar 

  11. Bozkurt, D., Baglioglu, O., Uyar, E.: Authorship attribution: performance of various features and classification methods computer and information sciences (2007)

    Google Scholar 

  12. Taş, T., Görür, A.: Author identification for Turkish texts. J. Arts Sci. 7, 151–161 (2007)

    Google Scholar 

  13. Türkoğlu, F., Diri, B., Amasyalı, M.F.: Author attribution of Turkish texts by feature mining. In: Proceedings of the 3rd International Conference on Intelligent Computing, ICIC 2007 Qingdao, China, LNCS 4681/2007 (2007)

    Google Scholar 

  14. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000)

    Article  Google Scholar 

  15. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  16. Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556

    Google Scholar 

  17. Stamatatos, E.: Author identification using imbalanced and limited training text. In: Proceedings of the 18th International Conference and Database and Expert Systems Applications, Regensburg, pp. 237–41. IEEE Computer Society, Germany (2007)

    Google Scholar 

  18. Luyckx, K., Daelemans, W.: The effect of author set size and data size in authorship attribution. Literary Linguist. Comput. 26(1), 35–55 (2011)

    Article  Google Scholar 

  19. Grieve, J.: Quantitative authorship attribution: an evaluation of techniques. Literary Linguist. Comput. 22, 251–270 (2007)

    Google Scholar 

  20. http://code.google.com/p/zemberek/

  21. Vapnik, V.: The nature of statistical learning theory. Springer, New York (1995)

    Google Scholar 

  22. Manning, C.D., Raghavan, P., Schütze, H.: Information retrieval. Cambridge University Press (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yasemin Bay .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Bay, Y., Çelebi, E. (2016). Feature Selection for Enhanced Author Identification of Turkish Text. In: Abdelrahman, O., Gelenbe, E., Gorbil, G., Lent, R. (eds) Information Sciences and Systems 2015. Lecture Notes in Electrical Engineering, vol 363. Springer, Cham. https://doi.org/10.1007/978-3-319-22635-4_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-22635-4_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-22634-7

  • Online ISBN: 978-3-319-22635-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics