Abstract
Vandalism of the content has always been one of the greatest problems for Wikipedia, yet only few completely automatic solutions for solving it have been developed so far. Volunteers still spend large amounts of time correcting vandalized page edits, instead of using this time to improve the quality of the content of articles. The purpose of this paper is to introduce a new vandalism detection system, that only uses natural language processing and machine learning techniques. The system has been evaluated on a corpus of real vandalized data in order to test its performance and justify the design choices. The same expert annotated wikitext, extracted from the encyclopedia’s database, is used to evaluate different vandalism detection algorithms. The paper presents a critical analysis of the obtained results, comparing them to existing solutions, and suggests different statistical classification methods that bring several improvements to the task at hand.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Potthast, M., Stein, B., Gerling, R.: Automatic Vandalism Detection in Wikipedia. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 663–668. Springer, Heidelberg (2008)
Adler, B.T., de Alfaro, L., Pye, I.: Detecting Wikipedia Vandalism using WikiTrust. In: Proceedings of the 2010 Conference on Multilingual and Multimodal Information Access Evaluation (2010)
Freund, Y., Mason, L.: The Alternating Decision Tree Algorithm. In: Proceedings of the 16th International Conference on Machine Learning (1999)
Harpalani, M., Phumprao, T., Bassi, M., Hart, M., Johnson, R.: Wiki Vandalysis - Wikipedia Vandalism Analysis. In: Proceedings of the 2010 Conference on Multilingual and Multimodal Information Access Evaluation (2010)
West, A.G., Kannan, S., Lee, I.: Detecting Wikipedia Vandalism via Spatio-Temporal Analysis of Revision Metadata. In: Proceedings of the Third European Workshop on System Security EUROSEC (2010)
Potthast, M.: Crowdsourcing a Wikipedia Vandalism Corpus. In: Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (2010)
Wikipedia vandalism policy, http://en.wikipedia.org/wiki/Wikipedia:Vandalism
Java CSV library, http://sourceforge.net/projects/javacsv/
Diff, Match and Patch library, http://code.google.com/p/google-diff-match-patch
Apache Lucene, http://lucene.apache.org/core/
WordNet – a lexical database for English, http://wordnet.princeton.edu
WS4J library, http://ws4j.googlecode.com
Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: Proceedings of International Conference Research on Computational Linguistics, ROCLING X (1997)
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011), Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
SVM Weka, http://cns.bu.edu/~gsc/CN710/pmwiki.php?n=Main.SVMWeka
PAN 2011 conference website, http://www.webis.de/research/events/pan-11
Mola Velasco, S.M.: Wikipedia Vandalism Detection Through Machine Learning: Feature Review and New Proposals. Notebook Papers of CLEF 2010 LABs and Workshops (2010)
Javanmardi, S.: Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through Lasso. In: Proceedings of the 7th International Symposium on Wikis and Open Collaboration (2011)
Chichkov, D.: Submission to the 1st International Competition on Wikipedia Vandalism Detection. SC Software Inc. (2010)
Seaward, L.: Submission to the 1st International Competition on Wikipedia Vandalism Detection. Universtiy of Ottawa (2010)
Hegedus, I., Ormándi, R., Farkas, R., Jelasity, M.: Novel Balanced Feature Representation for Wikipedia Vandalism Detection Task: Lab Report for PAN at CLEF 2010 (2010)
Drăguşanu, C.-A., Cufliuc, M., Iftene, A.: Detecting Wikipedia Vandalism using Machine Learning. Notebook Paper for the CLEF 2011 LABs Workshop (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cioiu, D., Rebedea, T. (2013). WikiDetect: Automatic Vandalism Detection for Wikipedia Using Linguistic Features. In: Bǎdicǎ, C., Nguyen, N.T., Brezovan, M. (eds) Computational Collective Intelligence. Technologies and Applications. ICCCI 2013. Lecture Notes in Computer Science(), vol 8083. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40495-5_32
Download citation
DOI: https://doi.org/10.1007/978-3-642-40495-5_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40494-8
Online ISBN: 978-3-642-40495-5
eBook Packages: Computer ScienceComputer Science (R0)