Abstract
One problem of building a Thai plagiarism corpus is the unavailability of the corpus with real examples of plagiarized texts. To solve the problem, we present a new design and construction of a Thai plagiarism corpus, called TPLAC-2019, to evaluate the plagiarism detection algorithms for Thai. The process of Thai plagiarism corpus creation consists of two methods: 1) simulated plagiarism method, and 2) artificial plagiarism method. For the simulated plagiarism method, we provided a Thai plagiarism tagging tool called PlaTool and a Thai plagiarism guideline for assisting human annotators to plagiarize the text passages. As for artificial plagiarism method, plagiarized documents are automatically generated by a machine. Besides, a new method to automatically create plagiarized text passages is proposed in the artificial plagiarism method. The objective of this proposed method is to automatically create plagiarized text passages that resemble human language. To evaluate the performance of machine-generated Thai plagiarized text passages, we prepared the test sets which are generated from the baseline and the proposed methods. The experiments are set up to compare the readability of human-readable texts in plagiarized documents between two different methods. The experimental results show that the proposed method helps improve the readability of human-readable texts which is increased up to 40%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Clough, P., Stevenson, M.: Developing a corpus of plagiarised short answers. Lang. Resour. Eval. 45(1), 5–24 (2011)
Taerungruang, S., Aroonmanakun, W.: Constructing an academic Thai plagiarism corpus for benchmarking plagiarism detection systems. J. Lang. Stud. 18(3), 186–202 (2018)
Miranda-Jiménez, S., Stamatatos, E.: Automatic generation of summary obfuscation corpus for plagiarism detection. J. Appl. Sci. 14(3), 99–112 (2017)
Juričić, V., Štefanec, V., Bosanac, S.: Multilingual plagiarism detection corpus. In: 35th International Convention MIPRO, pp. 1310–1314. IEEE, Croatia (2012)
Barrón-Cedeño, A., Potthast, M., Rosso, P., Stein, B., Eiselt, A.: Corpus and evaluation measures for automatic plagiarism detection. In: The Seventh Conference on International Language Resources and Evaluation, Malta (2010)
Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: 23rd International Conference on Computational Linguistics, pp. 997–1005. Association for Computational Linguistics, China (2010)
Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.: Overview of the 1st international competition on plagiarism detection. In: SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09), pp. 1–9 (2009)
Mohtaj, S., Asghari, H., Zarrabi, V.: Developing monolingual English corpus for plagiarism detection using human annotated paraphrase corpus. In: Working Notes of CLEF 2015 (2015)
Siddiqui, M.A., Khan, I.H., Jambi, K.M., Elhaj, S.O., Bagais, A.: Developing an Arabic plagiarism detection corpus. In: The International Conference on Computer Science, Engineering and Information Technology (CSEIT-2014), Australia, pp. 261–269 (2014)
Sharjeel, M., Rayson, P., Muhammad, R., Nawab, A.: UPPC-Urdu paraphrase plagiarism corpus. In: 10th International Conference on Language Resources and Evaluation Conference (LREC), pp. 1832–1836. Lancaster University (2016)
Barrón-Cedeño, A., Vila, M., Marti, M.A., Rosso, P.: Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)
Clough, P., Gaizauskas, R., Piao, S.S., Wilks, Y., METER: MEasuring TExt Reuse. In: 40th Annual Meeting of the Association for Computational Linguistics, pp. 152–159. Association for Computational Linguistics, Pennsylvania (2002)
Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd international competition on plagiarism detection. In: Notebook Papers of CLEF 2010 LABs and Workshops (2010)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Thaiprayoon, S. et al. (2020). Developing a Framework for a Thai Plagiarism Corpus. In: Nguyen, LM., Phan, XH., Hasida, K., Tojo, S. (eds) Computational Linguistics. PACLING 2019. Communications in Computer and Information Science, vol 1215. Springer, Singapore. https://doi.org/10.1007/978-981-15-6168-9_42
Download citation
DOI: https://doi.org/10.1007/978-981-15-6168-9_42
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-6167-2
Online ISBN: 978-981-15-6168-9
eBook Packages: Computer ScienceComputer Science (R0)