Abstract
Research in automatic text plagiarism detection focuses on algorithms that compare suspicious documents against a collection of reference documents. Recent approaches perform well in identifying copied or modified foreign sections, but they assume a closed world where a reference collection is given. This article investigates the question whether plagiarism can be detected by a computer program if no reference can be provided, e.g., if the foreign sections stem from a book that is not available in digital form. We call this problem class intrinsic plagiarism analysis; it is closely related to the problem of authorship verification. Our contributions are threefold. (1) We organize the algorithmic building blocks for intrinsic plagiarism analysis and authorship verification and survey the state of the art. (2) We show how the meta learning approach of Koppel and Schler, termed “unmasking”, can be employed to post-process unreliable stylometric analysis results. (3) We operationalize and evaluate an analysis chain that combines document chunking, style model computation, one-class classification, and meta learning.
Similar content being viewed by others
Notes
The reduction \(\le_{{tt}}^{p}\) is in O(|d|2); within this time all possible outliers can be constructed for a document d. The reduction \(\le_{{tt}}^{p}\) computes the answer to AVfind from the m answers to AVoutlier by means of a truth table tt, which is a disjunction here.
Function words and stop words are not disjunct sets: most function words in fact are stop words; however, the converse does not hold.
The corpus can be downloaded at http://www.webis.de/research/corpora.
References
Argamon, S., Šarić, M., & Stein, S. S. (2003). Style mining of electronic messages for multiple authorship discrimination: First results. In KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 475–480). New York, NY, USA: ACM. ISBN 1-58113-737-0. doi:10.1145/956750.956805.
Bernstein, Y., & Zobel, J. (2004). A scalable system for identifying co-derivative documents. In A. Apostolico & M. Melucci (Eds.), Proceedings of the string processing and information retrieval symposium (SPIRE) (pp. 55–67). Padova, Italy: Springer. Published as LNCS 3246.
Brin, S., Davis, J., & Garcia-Molina, H. (1995). Copy detection mechanisms for digital documents. In SIGMOD ’95 (pp. 398–409). New York, NY, USA: ACM Press. ISBN 0-89791-731-6.
Broder, A. Z., Eiron, N., Fontoura, M., Herscovici, M., Lempel, R., McPherson, J., et al. (2006). Indexing shared content in information retrieval systems. In EDBT ’06 (pp. 313–330).
Chall, J. S., & Dale, E. (1995). Readability revisited: The new Dale–Chall readability formula. Cambridge, MA: Brookline Books.
Chaski, C. E. (2005). Who’s at the keyboard? authorship attribution in digital evidence investigations. IJDE, 4(1), 1–14.
Chawla, N. V., Bowyer, K. W., Kegelmeyer, P. W. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357.
Choi, F. Y. Y. (2000). Advances in domain independent linear text segmentation. In Proceedings of the first conference on North American chapter of the association for computational linguistics (pp. 26–33). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Dale, E., & Chall, J. S. (1948). A formula for predicting readability. Educational Research Bulletin, 27, 11–20.
Finkel, R. A., Zaslavsky, A., Monostori, K., & Schmidt, H. (2002). Signature extraction for overlap detection in documents. In Proceedings of the 25th Australian conference on Computer science (pp. 59–64). Australian Computer Society, Inc. ISBN 0-909925-82-8.
Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32, 221–233.
Gionis, A., Indyk, P., & Motwani, R. (1999). Similarity search in high dimensions via hashing. In Proceedings of the 25th VLDB conference Edinburgh, Scotland (pp. 518–529).
Graham, N., Hirst, G., & Marthi, B. (2005). Segmenting a document by stylistic character. Natural Language Engineering, 11(4), 397–415. Supersedes August 2003 workshop version.
Gunning, R. (1952). The technique of clear writing. New York: McGraw-Hill.
Henzinger, M. (2006). Finding near-duplicate web pages: A large-scale evaluation of algorithms. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 284–291). New York, NY, USA: ACM Press. ISBN 1-59593-369-7. doi:10.1145/1148170.1148222.
Hilton, M. L., & Holmes, D. I. (1993). An assessment of cumulative sum charts for authorship attribution. Literary and Linguistic Computing, 8(2), 73–80.
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313, 504–507.
Hoad, T. C., & Zobel, J. (2003). Methods for identifying versioned and plagiarised documents. American Society for Information Science and Technology, 54(3), 203–215.
Holmes, D. I. (1998). The evolution of stylometry in humanities scholarship. Literary and Linguistic, 13(3), 111–117. doi:10.1093/llc/13.3.111.
Honore, A. (1979). Some simple measures of richness of vocabulary. Association for Literary and Linguistic Computing Bulletin, 7(2), 172–177.
Indyk, P., & Motwani, R. (1998). Approximate nearest neighbor—Towards removing the curse of dimensionality. In Proceedings of the 30th symposium on theory of computing (pp. 604–613).
Juola, P. (2006). Authorship attribution. Foundation Trends Information Retrieval 1(3), 233–334, ISSN 1554-0669. doi:10.1561/1500000005.
Kacmarcik, G., & Gamon, M. (2006). Obfuscating document stylometry to preserve author anonymity. In Proceedings of the COLING/ACL on main conference poster sessions (pp. 444–451). Morristown, NJ, USA: Association for Computational Linguistics.
Kincaid, J., Fishburne, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Research branch report 8–75. Millington TN: Naval Technical Training US Naval Air Station.
Kjell, B., Woods Addison, W., & Frieder, O. (1994). Discrimination of authorship using visualization. Information Processing and Management, 30(1), 141–150. ISSN 0306-4573. doi:10.1016/0306-4573(94)90029-9.
Kleinberg, J. (1997). Two algorithms for nearest-neighbor search in high dimensions. In STOC ’97: Proceedings of the twenty-ninth annual ACM symposium on theory of computing.
Koppel, M., & Schler, J. (2003). Exploiting stylistic idiosyncrasies for authorship attribution. In Proceedings of IJCAI’03 workshop on computational approaches to style analysis and synthesis. Mexico: Acapulco.
Koppel, M., & Schler, J. (2004a). Authorship verification as a one-class classification problem. In ICML ’04: Proceedings of the twenty-first international conference on Machine learning (pp. 62). New York, NY, USA: ACM. ISBN 1-58113-828-5. doi:10.1145/1015330.1015448.
Koppel, M., & Schler, J. (2004b). Authorship verification as a one-class classification problem. In Proceedings of the 21st international conference on machine learning. Banff, Canada: ACM Press.
Koppel, M., Schler, J., Argamon, S., & Messeri, E. (2006). Authorship attribution with thousands of candidate authors. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 659–660). New York, NY, USA: ACM. ISBN 1-59593-369-7. doi:10.1145/1148170.1148304.
Koppel, M., Schler, J., & Bonchek-Dokow, E. (2007). Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research, 8, 1261–1276. ISSN 1533-7928.
Koppel, M., Schler, J., & Argamon, S. (2009). Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60(1), 9–26.
Malyutov, M. B. (2006). Authorship attribution of texts: A review. Lecture Notes in Computer Science, 2063, 362–380.
Manevitz, L. M., & Yousef, M. (2001). One-class SVMs for document classification. Journal of Machine Learning Research, 2, 139–154.
Mansfield, J. S. (2004). Textbook plagiarism in psy101 general psychology: incidence and prevention. In Proceedings of the 18th annual conference on undergraduate teaching of psychology: Ideas and innovations. New York, USA: SUNY Farmingdale.
Meyer zu Eissen, S., & Stein, B. (2004). Genre classification of web pages: User study and feasibility analysis. In S. Biundo, T. Frühwirth, & G. Palm (Eds.), KI 2004: Advances in artificial intelligence, vol. 3228 LNAI of Lecture Notes in artificial intelligence (pp. 256–269). Berlin Heidelberg New York: Springer. ISBN 0302-9743.
Meyer zu Eissen, S., & Stein, B. (2006). Intrinsic plagiarism detection. In M. Lalmas, A. MacFarlane, S. M. Rüger, A. Tombros, T. Tsikrika, & A. Yavlinsky (Eds.), Proceedings of the European conference on information retrieval (ECIR 2006), vol. 3936 of Lecture Notes in Computer Science (pp. 565–569). New York: Springer. ISBN 3-540-33347-9.
Meyer zu Eissen, S., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. In R. Decker & H. J. Lenz (Eds.), Advances in data analysis (pp. 359–366). New York: Springer. ISBN 978-3-540-70980-0.
Morton, A. Q., & Michaelson, S. (1990). The qsum plot. Technical report, University of Edinburgh.
Mosteller, F., & Wallace, D. L. (1964). Inference and disputed authorship: Federalist papers. Reading, MA: Addison-Wesley Educational Publishers Inc, 1964. ISBN 0201048655.
Pavelec, D., Oliveira, L. S., Justino, E. J. R., & Batista, L. V. (2008). Using conjunctions and adverbs for author verification. Journal of UCS, 14(18), 2967–2981.
Potthast, M., Eiselt, A., Stein, B., Barròn Cedeño, A., & Rosso, P. (Eds.). (2009). Webis at Bauhaus-Universität Weimar and NLEL at Universidad Polytécnica de Valencia. PAN Plagiarism Corpus 2009 (PAN-PC-09). http://www.webis.de/research/corpora.
Rätsch, G., Mika, S., Schölkopf, B., & Müller, K.-R. (2002). Constructing boosting algorithms from SVMs: An application to one-class classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9), 1184–1199. ISSN 0162-8828. doi:10.1109/TPAMI.2002.1033211.
Reynar, J. C. (1998). Topic segmentation: Algorithms and applications. Ph.D. thesis, University of Pennsylvania.
Rudman, J. (1997). The state of authorship attribution studies: Some problems and solutions. Computers and the Humanities, 31, 351–365.
Russel, S. J., & Norvig, P. (1995). Artificial intelligence: A modern approach. Englewood Cliffs, NJ: Prentice-Hall.
Sanderson, C., & Guenter, S. (2006a). On authorship attribution via markov chains and sequence kernels. In Pattern recognition, 2006. ICPR 2006. 18th international conference on (vol. 3, pp. 437–440). doi:10.1109/ICPR.2006.899.
Sanderson, C., & Guenter, S. (2006b). Short text authorship attribution via sequence kernels, markov chains and author unmasking: An investigation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (pp. 482–491). URL http://acl.ldc.upenn.edu/W/W06/W06-1657.pdf.
Stamatatos, E. (2007). Author identification using imbalanced and limited training texts. In A. M. Tjoa & R. R. Wagner (Eds.), 18th international conference on database and expert systems applications (DEXA 07) (pp. 237–241). IEEE, September 2007. ISBN 0-7695-2932-1. doi: 10.1109/DEXA.2007.37.
Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of American Society for Information Science & Technology, 60(3), 538–556. ISSN 1532-2882. doi:10.1002/asi.v60:3.
Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2001). Computer-based authorship attribution without lexical measures. Computers and the Humanities, 35, 193–214.
Stefik, M. (1995). Introduction to knowledge systems. San Mateo, CA, USA: Morgan Kaufmann.
Stein, B. (2005). Fuzzy-fingerprints for text-based information retrieval. In K. Tochtermann & H. Maurer (Eds.), Proceedings of the 5th international conference on knowledge management (I-KNOW 05), Graz, Journal of Universal Computer Science (pp. 572–579). Know-Center.
Stein, B. (2007). Principles of hash-based text retrieval. In C. Clarke, N. Fuhr, N. Kando, W. Kraaij, & A. de Vries (Eds.), 30th annual international ACM SIGIR conference (pp. 527–534). ACM, July 2007. ISBN 987-1-59593-597-7.
Stein, B., & Meyer zu Eissen, S. (2007). Intrinsic plagiarism analysis with meta learning. In B. Stein, M. Koppel, & E. Stamatatos (Eds.), SIGIR workshop workshop on plagiarism analysis, authorship identification, and near-duplicate detection (PAN 07) (pp. 45–50). CEUR-WS.org, July 2007. URL http://ceur-ws.org/Vol-276.
Stein, B., & Meyer zu Eissen, S. (2007). Topic-identifikation: Formalisierung, analyse und neue Verfahren. KI—Künstliche Intelligenz, 3, 16–22. ISSN 0933-1875. URL http://www.kuenstliche-intelligenz.de/index.php?id=7758.
Stein, B., Lipka, N., & Meyer zu Eissen, S. (2008). Meta analysis within authorship verification. In A. M. Tjoa & R. R. Wagner (Eds.), 19th international conference on database and expert systems applications (DEXA 08) (pp. 34–39). IEEE, September 2008. ISBN 978-0-7695-3299-8. doi:10.1109/DEXA.2008.20.
Surdulescu R. (2004). Verifying authorship. Final project report CS391L, University of Texas at Austin
Tax, D. M. J. (2001). One-class classification. Ph.D. thesis, Technische Universiteit Delft.
Tax D. M. J., & Duin, R. P. W. (2001). Combining one-class classifiers. In Proceedings of the second international workshop on multiple classifier systems (pp. 299–308). New York: Springer. ISBN 3-540-42284-6.
Tweedie, F. J., & Baayen, H. R. (1998). How variable may a constant be? measures of lexical richness in perspective. Computers and the Humanities 32(5):323–352. doi:10.1023/A:1001749303137.
van Halteren, H. (2004). Linguistic profiling for author recognition and verification. In ACL ’04: Proceedings of the 42nd annual meeting on association for computational linguistics (pp. 199). Morristown, NJ, USA: Association for Computational Linguistics. doi:10.3115/1218955.1218981.
van Halteren, H. (2007). Author verification by linguistic profiling: An exploration of the parameter space. ACM Transactions on Speech and Language Processing, 4(1), 1. ISSN 1550-4875. doi:10.1145/1187415.1187416.
Yang, H., & Callan, J. P. (2006). Near-duplicate detection by instance-level constrained clustering. In E. N. Efthimiadis, S. Dumais, D. Hawking, & K. Järvelin (Eds.), SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 421–428). ISBN 1-59593-369-7.
Yule, G. (1944). The statistical study of literary vocabulary. Cambridge: Cambridge University Press
Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the American Society for Information Science and Technology, 57(3), 378–393. doi:10.1002/asi.20316.
Zipf, G. K. (1932). Selective studies and the principle of relative frequency in language.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Stein, B., Lipka, N. & Prettenhofer, P. Intrinsic plagiarism analysis. Lang Resources & Evaluation 45, 63–82 (2011). https://doi.org/10.1007/s10579-010-9115-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-010-9115-y