Abstract
In pattern recognition, the handwritten character recognition (HCR) is considered as the classical challenge. In particular, the benchmark dataset for HCR in the Gujarati language is limited. To overcome this challenge, a proper dataset is required for experimentation. Hence, this work introduces dataset generation for the Gujarati language using pre-processing and classification techniques. Initially, the handwritten data is collected from various native Gujarati writers. In this work, there are three processes carried out to generate the dataset. Initially, the pre-processing stages like a selection of image, noise removal, normalization, conversion of integer value to double, grayscale image into a binary image, dimensionality reduction, and vector conversation are performed. Then, the pre-processed image is segmented using line segmentation, character segmentation and word segmentation. Finally, the data are classified using a Convolutional neural network (CNN). The kappa and FPR (False Positive Rate) values achieved by the CNN are 0.981 and 0.189.
Similar content being viewed by others
Data Availability
Data sharing is not applicable to this article.
References
Sharma, M. K., & Dhaka, V. P. (2016). Segmentation of English offline handwritten cursive scripts using a feedforward neural network. Neural Computing and Applications, 27, 1369–1379.
Gaur, S., Sonkar, S., & Roy, P. P. (2015). Generation of synthetic training data for handwritten Indic script recognition. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 491–495. IEEE.
Rabi, M., Amrouch, M., & Mahani, Z. (2018). Recognition of cursive Arabic handwritten text using embedded training based on hidden Markov models. International Journal of Pattern Recognition and Artificial Intelligence, 32(01), 1860007.
Varga, T., & Bunke, H. (2004). Off-line handwritten text line recognition using a mixture of natural and synthetic training data. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR, 2004(2), pp. 545–549. IEEE.
Grother, P. J. (1995). NIST special database 19. Hand printed forms and characters database, National Institute of Standards and Technology, 10, 69.
Marti, U. V., & Bunke, H. (2002). The IAM-database: An English sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition, 5, 39–46.
Al-Ohali, Y., Cheriet, M., & Suen, C. (2003). Databases for recognition of handwritten Arabic cheques. Pattern Recognition, 36(1), 111–121.
Mahmoud, S. A., Ahmad, I., Alshayeb, M., Al-Khatib, W. G., Parvez, M. T., Fink, G. A., & El Abed, H. (2012). Khatt: Arabic offline handwritten text database. In 2012 International conference on frontiers in handwriting recognition, pp. 449–454. IEEE.
Liu, C. L., Yin, F., Wang, D. H., & Wang, Q. F. (2011). CASIA online and offline Chinese handwriting databases. In 2011 international conference on document analysis and recognition, pp. 37–41. IEEE.
Su, T., Zhang, T., & Guan, D. (2007). Corpus-based HIT-MW database for offline recognition of general-purpose Chinese handwritten text. International Journal of Document Analysis and Recognition (IJDAR), 10, 27–38.
Rajyagor, B., & Rakholia, R. (2021). Isolated Gujarati Handwritten Character Recognition (HCR) using Deep Learning (LSTM). In 2021 Fourth International Conference on Electrical, Computer and Communication Technologies (ICECCT), pp. 1–6. IEEE.
Rajyagor, B., & Rakholia, R. (2021). Tri-level handwritten text segmentation techniques for Gujarati language. Indian Journal of Science and Technology, 14(7), 618–627.
Jain, A. A., & Arolkar, H. A. (2021). A Study of Gujarati Character Recognition. In Proceedings of International Conference on Communication and Computational Technologies: ICCCT-2019, pp. 229–239. Springer Singapore.
Borad, P., Dethaliya, P., & Mehta, A. (2020). Augmentation based Convolutional Neural Network for recognition of Handwritten Gujarati Characters. In 2020 IEEE International Conference for Innovation in Technology (INOCON), pp. 1–4. IEEE.
Chaudhari, S., & Gulati, R. M. (2016). Script identification using Gabor feature and SVM classifier. Procedia Computer Science, 79, 85–92.
Hassan, E., Garg, R., Chaudhury, S., & Gopal, M. (2011). Script based text identification: a multi-level architecture. In Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, pp. 1–8.
Manjusha, K., Kumar, M. A., & Soman, K. P. (2019). On developing handwritten character image database for Malayalam language script. Engineering Science and Technology, an International Journal, 22(2), 637–645.
Uddin, I., Ramli, D. A., Khan, A., Bangash, J. I., Fayyaz, N., Khan, A., & Kundi, M. (2021). Benchmark pashto handwritten character dataset and pashto object character recognition (OCR) using deep neural network with rule activation function. Complexity, 2021, 1–16.
Bin Ahmed, S., Naz, S., Swati, S., Razzak, I., Umar, A. I., & Ali Khan, A. (2017). UCOM offline dataset-an Urdu handwritten dataset generation. The international Arab journal of information technology, 14(2), 239–245.
Singh, P. K., Sarkar, R., Das, N., Basu, S., Kundu, M., & Nasipuri, M. (2018). Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images. Multimedia Tools and Applications, 77, 8441–8473.
Singh, H., Sharma, R. K., Kumar, R., Verma, K., Kumar, R., & Kumar, M. (2020). A benchmark dataset of online handwritten gurmukhi script words and numerals. In Computer Vision and Image Processing: 4th International Conference, CVIP 2019, Jaipur, India, September 27–29, 2019, Revised Selected Papers, Part II, 4, pp. 457–466. Springer Singapore.
Pareek, J., Singhania, D., Kumari, R. R., & Purohit, S. (2020). Gujarati handwritten character recognition from text images. Procedia Computer Science, 171, 514–523.
Sorathiya, D. R. (2021). Gujarati Handwritten Character Recognition using Convolution Neural Network (Doctoral dissertation, Dublin, National College of Ireland).
Rajyagor, B., & Rakholia, R. (2021). Isolated Gujarati handwritten character recognition (HCR) using deep learning (LSTM). In 2021 Fourth International Conference on Electrical, Computer and Communication Technologies (ICECCT). pp. 1–6. IEEE.
Funding
No funding is provided for the preparation of manuscript.
Author information
Authors and Affiliations
Contributions
All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Conflict of interest
Authors declare that they have no conflict of interest.
Ethical Approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Consent to Participate
All the authors involved have agreed to participate in this submitted article.
Consent to Publish
All the authors involved in this manuscript give full consent for publication of this submitted article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Suthar, S.B., Thakkar, A.R. Dataset Generation for Gujarati Language Using Handwritten Character Images. Wireless Pers Commun 136, 2163–2184 (2024). https://doi.org/10.1007/s11277-024-11369-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11277-024-11369-9