Abstract
This study investigates the potential use of Natural Language Processing (NLP) techniques to analyze coding standards violations within the context of an introductory programming course. In particular, the study evaluates the effectiveness of various advanced text embedding techniques, including Bag of Words (BOW), Doc2Vec, and BERT, in clustering coding standards violations. This study aims to determine which text embedding techniques contribute to the most accurate clustering of errors. Our findings highlight the superiority of Doc2Vec embeddings in effectively clustering related errors compared to the alternative techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Li, X., Prasad, C.: Effectively teaching coding standards in programming. In: Proceedings of the 6th Conference on Information Technology Education. SIGITE 2005, pp. 239–244. Association for Computing Machinery, New York, NY, USA (2005). https://doi.org/10.1145/1095714.1095770
Chen, H.-M., Chen, W.-H., Lee, C.-C.: An automated assessment system for analysis of coding convention violations in java programming assignments. J. Inf. Sci. Eng. 34, 1203–1221 (2018)
Hofbauer, M., Bachhuber, C., Kuhn, C., Steinbach, E.: Teaching software engineering as programming over time. In: 2022 IEEE/ACM 4th International Workshop on Software Engineering Education for the Next Generation (SEENG), pp. 51–58 (2022). https://doi.org/10.1145/3528231.3528353
Karnalim, O., Simon, Chivers, W.: Work-in-progress: code quality issues of computing undergraduates. In: 2022 IEEE Global Engineering Education Conference (EDUCON), pp. 1734–1736 (2022). https://doi.org/10.1109/EDUCON52537.2022.9766807
Karnalim, O., Simon: Promoting code quality via automated feedback on student submissions. In: 2021 IEEE Frontiers in Education Conference (FIE), pp. 1–5 (2021). https://doi.org/10.1109/FIE49875.2021.9637193
Albluwi, I., Salter, J.: Using static analysis tools for analyzing student behavior in an introductory programming course. Jordanian J. Comput. Inform. Technol. (JJCIT) 6(3), 215–233 (2020)
Checkstyle. https://checkstyle.sourceforge.io. Accessed 7 Jan 2024
PMD. https://pmd.github.io. Accessed 7 Jan 2024
He, J., Xu, L., Yan, M., Xia, X., Lei, Y.: Duplicate bug report detection using dual-channel convolutional neural networks. In: Proceedings of the 28th International Conference on Program Comprehension, pp. 117–127 (2020)
Imhmed, E., Ceh-Varela, E., Scott, K.: Identifying code quality issues for undergraduate students using static analysis and NLP. In: 2023 International Conference on Computational Science and Computational Intelligence (CSCI). IEEE (2023)
Abubakar, H.D., Umar, M., Bakale, M.A.: Sentiment classification: Review of text vectorization methods: bag of words, Tf-Idf, Word2vec and Doc2vec. SLU J. Sci. Technol. 4(1 & 2), 27–33 (2022)
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196. PMLR (2014)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Stegeman, M., Barendsen, E., Smetsers, S.: Designing a rubric for feedback on code quality in programming courses. In: Proceedings of the 16th Koli Calling International Conference on Computing Education Research. Koli Calling 2016, pp. 160–164. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2999541.2999555
Edwards, S.H., Kandru, N., Rajagopal, M.B.M.: Investigating static analysis errors in student java programs. In: Proceedings of the 2017 ACM Conference on International Computing Education Research. ICER 2017, pp. 65–73. Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3105726.3106182
Oskouei, E.H., Kalıpsız, O.: Comparing bug finding tools for java open source software (2018)
Wang, J., Dong, Y.: Measurement of text similarity: a survey. Information 11(9), 421 (2020)
Han, M., Zhang, X., Yuan, X., Jiang, J., Yun, W., Gao, C.: A survey on the techniques, applications, and performance of short text semantic similarity. Concurr. Comput. Pract. Exp. 33(5), 5971 (2021)
Prakoso, D.W., Abdi, A., Amrit, C.: Short text similarity measurement methods: a review. Soft. Comput. 25, 4699–4723 (2021)
Selva Birunda, S., Kanniga Devi, R.: A review on word embedding techniques for text classification. In: Raj, J.S., Iliyasu, A.M., Bestak, R., Baig, Z.A. (eds.) Innovative Data Communication Technologies and Application. LNDECT, vol. 59, pp. 267–281. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-9651-3_23
Ceh-Varela, E., Imhmed, E.: Uncovering water research with natural language processing. In: 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC), pp. 983–984 (2023). https://doi.org/10.1109/COMPSAC57700.2023.00138
Mitra, B., Craswell, N.: Neural text embeddings for information retrieval. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 813–814 (2017)
Greenacre, M., Groenen, P.J., Hastie, T., d’Enza, A.I., Markos, A., Tuzhilina, E.: Principal component analysis. Nat. Rev. Methods Primers 2(1), 100 (2022)
Sinaga, K.P., Yang, M.-S.: Unsupervised k-means clustering algorithm. IEEE Access 8, 80716–80727 (2020)
Shahapure, K.R., Nicholas, C.: Cluster quality analysis using silhouette score. In: 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), pp. 747–748. IEEE (2020)
Imhmed, E., Cook, J., Badawy, A.-H.: Evaluation of a novel scratchpad memory through compiler supported simulation. In: 2022 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7. IEEE (2022)
Imhmed, E.A.: Understanding performance of a novel local memory store design through compiler-driven simulation. PhD thesis, New Mexico State University (2022)
Akhila, C., Saleena, N.: Value based redundancy detection in SSA code. In: 2016 IEEE Annual India Conference (INDICON), pp. 1–5. IEEE (2016)
Zhang, M.: Detecting redundant operations with LLVM. http://james0zan.github.io/resource/GSoC15-Proposal-BloatDetection.pdf. Accessed 10 April 2024
Abu-gellban, H., Zhuang, Y., Nguyen, L., Zhang, Z., Imhmed, E.: CSDLEEG: identifying confused students based on EEG using multi-view deep learning. In: 2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC), pp. 1217–1222 (2022). https://doi.org/10.1109/COMPSAC54236.2022.00192
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ceh-Varela, E., Imhmed, E. (2025). Investigating Freshmen Students’ Coding Standards Challenges Using NLP Techniques. In: Weber, GW., Martinez Trinidad, J.F., Sheng, M., Ramachand, R., Kharb, L., Chahal, D. (eds) Information, Communication and Computing Technology. ICICCT 2024. Communications in Computer and Information Science, vol 2131. Springer, Cham. https://doi.org/10.1007/978-3-031-72483-1_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-72483-1_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72482-4
Online ISBN: 978-3-031-72483-1
eBook Packages: Computer ScienceComputer Science (R0)