[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

A Four-Tier Annotated Urdu Handwritten Text Image Dataset for Multidisciplinary Research on Urdu Script

Published: 16 May 2016 Publication History

Abstract

This article introduces a large handwritten text document image corpus dataset for Urdu script named CALAM (Cursive And Language Adaptive Methodologies). The database contains unconstrained handwritten sentences along with their structural annotations for the offline handwritten text images with their XML representation. Urdu is the fourth most frequently used language in the world, but due to its complex cursive writing script and low resources, it is still a thrust area for document image analysis. Here, a unified approach is applied in the development of an Urdu corpus by collecting printed texts, handwritten texts, and demographic information of writers on a single form. CALAM contains 1,200 handwritten text images, 3,043 lines, 46,664 words, and 101,181 ligatures. For capturing maximum variance among the words and handwritten styles, data collection is distributed among six categories and 14 subcategories. Handwritten forms were filled out by 725 different writers belonging to different geographical regions, ages, and genders with diverse educational backgrounds. A structure has been designed to annotate handwritten Urdu script images at line, word, and ligature levels with an XML standard to provide a ground truth of each image at different levels of annotation. This corpus would be very useful for linguistic research in benchmarking and providing a testbed for evaluation of handwritten text recognition techniques for Urdu script, signature verification, writer identification, digital forensics, classification of printed and handwritten text, categorization of texts as per use, and so on. The experimental results of some recently developed handwritten text line segmentation techniques experimented on the proposed dataset are also presented in the article for asserting its viability and usability.

References

[1]
S. Al-Ma’adeed, D. Elliman, and C. A. Higgins. 2002. A data base for Arabic handwritten text recognition research. In Proceedings of the 8th International Workshop on Frontiers in Handwriting Recognition. 485--489.
[2]
Y. Al-Ohali, M. Cheriet, and C. Suen. 2003. Databases for recognition of handwritten Arabic cheques. Pattern Recognition 36, 1 (2003), 111--121.
[3]
A. Alaei, P. Nagabhushan, and U. Pal. 2011a. A benchmark Kannada handwritten document dataset and its segmentation. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’11). 141--145.
[4]
A. Alaei, U. Pal, and P. Nagabhushan. 2011b. A new scheme for unconstrained handwritten text-line segmentation. Pattern Recognition 44, 4 (April 2011), 917--928.
[5]
A. Alaei, U. Pal, and P. Nagabhushan. 2012. Dataset and ground truth for handwritten text in four different scripts. International Journal of Pattern Recognition and Artificial Intelligence 26, 04 (2012), 1--25.
[6]
H. Alamri, J. Sadri, C. Y. Suen, and N. Nobile. 2008. A novel comprehensive database for Arabic offline handwriting recognition. In Proceedings of the 11th International Conference on Frontiers in Handwriting Recognition (ICFHR’08). 664--669.
[7]
S. Belhe, S. Chakravarthy, and A. G. Ramakrishnan. 2009. XML standard for indic online handwritten database. In Proceedings of the International Workshop on Multilingual OCR (MOCR’09). ACM, New York, NY, USA, Article 19, 4 pages.
[8]
A. S. Bhaskarabhatla, S. Madhvanath, M. N. S. S. K. Pavan Kumar, A. Balasubramanian, and C. V. Jawahar. 2004. Representation and annotation of online handwritten data. In Proceedings of the 9th International Workshop on Frontiers in Handwriting Recognition (IWFHR-9’04). 136--141.
[9]
S. Bhaskarabhatla and S. Madhvanath. 2004. Experiences in collection of handwriting data for online handwriting recognition in indic scripts. In Proceedings of the 4th International Conference Linguistic Resources and Evaluation (LREC’04). 2223--2226.
[10]
U. Bhattacharya and B. Chaudhuri. 2009. Handwritten numeral databases of indian scripts and multistage recognition of mixed numerals. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 3 (March 2009), 444--457.
[11]
Census. 2001. (2001). http://www.censusindia.gov.in/2001-common/censusdataonline.html.
[12]
P. Choudhary, N. Nain, and M. Ahmed. 2015. A unified approach for development of Urdu corpus for OCR and demographic purpose. In Proceedings of the 7th International Conference on Machine Vision (ICMV’15), Vol. 9445. 1--5.
[13]
L. Deng. 2012. The MNIST database of handwritten digit images for machine learning research {best of the web}. IEEE Signal Processing Magazine 29, 6 (Nov. 2012), 141--142.
[14]
R. I. M. Elanwar, M. A. Rashwan, and S. A. Mashali. 2010. OHASD: The first online Arabic sentence database handwritten on tablet PC. In Proceedings of the World Academy of Science, Engineering and Technology (WASET’10), International Conference on Signal and Image Processing (ICSIP’10) 4, 12 (2010), 585--590.
[15]
D. Elliman and N. Sherkat. 2001. A truthing tool for generating a database of cursive words. In Proceedings of the 6th International Conference on Document Analysis and Recognition, 2001. 1255--1262.
[16]
B. Gatos, N. Stamatopoulos, and G. Louloudis. 2009. ICDAR 2009 handwriting segmentation contest. In Proceedings of 10th International Conference on Document Analysis and Recognition (ICDAR’09). 1393--1397.
[17]
B. Gatos, N. Stamatopoulos, and G. Louloudis. 2010. ICFHR 2010 handwriting segmentation contest. In Proceedings of 2010 International Conference on Frontiers in Handwriting Recognition (ICFHR’10). 737--742.
[18]
S. Godara, N. Nain, and M. Ahamed. 2014. Handwritten Urdu script segmentation using hybrid approach. In Proceedings of the DAR 2014 Satellite Workshop of ICVGIP 2014 on Document Analysis and Recognition, 2014.
[19]
E. Grosicki, M. Carr, E. Augustin, and F. Prłteux. 2006. La campagne d’valuation RIMES pour la reconnaissance de courriers manuscrits. In Actes 9me Colloque International Francophone sur lEcrit et le Document (CIFED’06). Fribourg, Suisse, 61--66.
[20]
I. Guyon, L. Schomaker, R. Plamondon, M. Liberman, and S. Janet. 1994. UNIPEN project of online data exchange and recognizer benchmarks. In Proceedings of the 12th IAPR International Conference on Computer Vision and Image Processing, Vol. 2. 29--33.
[21]
P. J. Haghighi, N. Nobile, C. L. He, and C. Y. Suen. 2009. A new large-scale multi-purpose handwritten Farsi database. In Proceedings of the 6th International Conference on Image Analysis and Recognition (ICIAR’09). Springer-Verlag, Berlin, 278--286.
[22]
J. J. Hull. 1994. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence 16, 5 (May 1994), 550--554.
[23]
N. Ide. 1998b. Corpus encoding standard: SGML guidelines for encoding linguistic corpora. In Proceedings of the 1st International Language Resources and Evaluation Conference. 463--470.
[24]
E. Indermhle, M. Liwicki, and H. Bunke. 2010. IAMonDo-database: An online handwritten document database with non-uniform contents. In Proceedings of the International Workshop on Document Analysis Systems. 97--104.
[25]
C. V. Jawahar, A. Balasubramanian, M. Meshesha, and A. M. Namboodiri. 2009. Retrieval of online handwriting by synthesis and matching. Pattern Recognition 42, 7 (2009), 1445--1457.
[26]
T. Kasar, D. Kumar, M. N. Anil Prasad, D. Girish, and A. G. Ramakrishnan. 2011. MAST: Multi-script annotation toolkit for scenic text. In Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data. ACM, New York, NY, Article 14, 8 pages.
[27]
D. Khanduja, N. Nain, and S. Panwar. 2013. A hybrid feature extraction algorithm for devanagari script. ACM Transactions on Asian Low-Resource Language Information Processing 15, 1, 105--111.
[28]
N. Kharma, M. Ahmed, and R. Ward. 1999. A new comprehensive database of handwritten arabic words, numbers, and signatures used for OCR testing. In Proceedings of the IEEE Canadian Conference on Electrical and Computer Engineering, Vol. 2. 766--768.
[29]
H. Khosravi and E. Kabir. 2007. Introducing a very large dataset of handwritten Farsi digits and a study on their varieties. Pattern Recognition Letters 28, 10 (2007), 1133--1141.
[30]
D. H. Kim, Y. S. Hwang, S. T. Park, E. J. Kim, S. H. Paek, and S. Y. Bang. 1993. Handwritten Korean character image database PE92. In Proceedings of the 2nd International Conference on Document Analysis and Recognition. 470--473.
[31]
A. Kumar, A. Balasubramanian, A. Namboodiri, and C. V. Jawahar. 2006. Model-based annotation of online handwritten datasets. In Proceedings of the International Workshop on Frontiers in Handwriting Recognition (IWFHR’06). Universit de Rennes, La Baule, Centre de Congreee Atlantia, France.
[32]
S. Kumar. 2010. An analysis of irregularities in Devanagari script writing: A machine recognition perspective. International Journal of Computer Science Engineering 2, 2 (2010), 274--279.
[33]
Y. Li, Y. Zheng, D. Doermann, S. Jaeger, and Yi Li. 2008. Script-independent text line segmentation in freestyle handwritten documents. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 8 (Aug. 2008), 1313--1329.
[34]
L. Likforman-Sulem, A. Zahour, and B. Taconet. 2007. Text line segmentation of historical documents: A survey. International Journal of Document Analysis and Recognition (IJDAR) 9, 2--4 (2007), 123--138.
[35]
C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang. 2011. CASIA online and offline chinese handwriting databases. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’11), 2011. 37--41.
[36]
G. Louloudis, B. Gatos, I. Pratikakis, and C. Halatsis. 2009. Text line and word segmentation of handwritten documents. Pattern Recognition 42, 12 (2009), 3169--3183. New Frontiers in Handwriting Recognition.
[37]
S. A. Mahmoud, I. Ahmad, M. Alshayeb, and W. G. Al-Khatib. 2011. A database for offline arabic handwritten text recognition. In Image Analysis and Recognition, Mohamed Kamel and Aurlio Campilho (Eds.). Lecture Notes in Computer Science, Vol. 6754. Springer, Berlin, 397--406.
[38]
V. Margner and H. El Abed. 2009. ICDAR 2009 Arabic handwriting recognition competition. In Proceedings of 10th International Conference on Document Analysis and Recognition, 2009. 1383--1387.
[39]
U.-V. Marti and H. Bunke. 2002. The IAM-database: An English sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition 5, 1 (2002), 39--46.
[40]
I. B. Messaoud and H. E. Abed. 2010. Automatic annotation for handwritten historical documents using Markov models. In Proceedings of the International Conference on Frontiers in Handwriting Recognition (ICFHR’10). 381--386.
[41]
S. Mozaffari, H. El Abed, V. Margner, K. Faez, and A. Amirshahi. 2008. IfN/Farsi-database: A database of Farsi handwritten city names. In Proceedings of the 11th International Conference on Frontiers in Handwriting Recognition (ICFHR’08). 397402.
[42]
M. Nakagawa, T. Higashiyama, Y. Yamanaka, S. Sawada, L. Higashigawa, and K. Akiyama. 1997. Online handwritten character pattern database sampled in a sequence of sentences without any writing instructions. In Proceedings of the 4th International Conference on Document Analysis and Recognition, 1997., Vol. 1. 376--381.
[43]
M. Nakagawa and K. Matsumoto. 2004. Collection of online handwritten Japanese character pattern databases and their analyses. Document Analysis and Recognition 7, 1 (2004), 69--81.
[44]
B. Nethravathi, C. P. Archana, K. Shashikiran, A. G. Ramakrishnan, and V. Kumar. 2010. Creation of a huge annotated database for Tamil and Kannada OHR. In Proceedings of the International Conference on Frontiers in Handwriting Recognition (ICFHR’10). 415--420.
[45]
S. Panwar and N. Nain. 2014. A novel segmentation methodology for cursive handwritten documents. IETE Journal of Research 60, 6 (2014), 432--439.
[46]
S. Panwar, N. Nain, S. Saxena, and P. C. Gupta. 2013. Language adaptive methodology for handwritten text line segmentation. In Computer Analysis of Images and Patterns, Richard Wilson, Edwin Hancock, Adrian Bors, and William Smith (Eds.). Lecture Notes in Computer Science, Vol. 8047. Springer, Berlin, 344--351.
[47]
M. Pechwitz, S. S. Maddouri, V. Mrgner, N. Ellouze, and H. Amiri. 2002. IFN/ENIT - database of handwritten Arabic words. In Francophone International Conference on Writing and Document (CIFED'02). Hammamet, Tunisia, 129--136.
[48]
S. T. Piantadosi. 2014. Zipfs word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin Review 21, 5 (2014), 1112--1130.
[49]
A. Raza, I. Siddiqi, A. Abidi, and F. Arif. 2012. An unconstrained benchmark Urdu handwritten sentence database with automatic line segmentation. In Proceedings of the 2012 International Conference on Frontiers in Handwriting Recognition (ICFHR’12). IEEE Computer Society, Washington, DC, 491--496.
[50]
B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. 2008. LabelMe: A database and web-based tool for image annotation. International Journal of Computer Vision 77, 1--3 (2008), 157--173.
[51]
R. Safabaksh, A. R. Ghanbarian, and G. Ghiasi. 2013. HaFT: A handwritten Farsi text database. In Proceedings of the 8th Iranian Conference on Machine Vision and Image Processing (MVIP’13). 89--94.
[52]
M. W. Sagheer, C.-L. He, N. Nobile, and C. Suen. 2009. A new large urdu database for offline handwriting recognition. In Proceedings of International Conference on Image Analysis and Processing (ICIAP’09), Pasquale Foggia, Carlo Sansone, and Mario Vento (Eds.). Lecture Notes in Computer Science, Vol. 5716. Springer, Berlin, 538--546.
[53]
T. Saito, H. Yamada, and K. Yamamoto. 1985. On the data base ETL9 of handprinted characters in JIS Chinese characters and its analysis (in Japanese). Transactions of the IECE Japan J68-D(4) (1985), 757--764.
[54]
R. Sarkar, N. Das, S. Basu, M. Kundu, M. Nasipuri, and Dk. Basu. 2012. CMATERdb1: A database of unconstrained handwritten Bangla and Bangla-English mixed script document image. International Journal of Document Analysis and Recognition 15, 1 (March 2012), 71--83.
[55]
E. Saund, J. Lin, and P. Sarkar. 2009. PixLabeler: User interface for pixel-level labeling of elements in document images. In Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR’09). IEEE Computer Society, Washington, DC, 646--650.
[56]
F. Slimane, R. Ingold, S. Kanoun, A. M. Alimi, and J. Hennebert. 2009. A new arabic printed text image database and evaluation protocols. In Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR’09). 946--950.
[57]
N. Stamatopoulos, B. Gatos, G. Louloudis, U. Pal, and A. Alaei. 2013. ICDAR 2013 handwriting segmentation contest. In Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR’13). 1402--1406.
[58]
M. Sthrenberg. 2012. The TEI and current standards for structuring linguistic data. Journal of the Text Encoding Initiative 3 (Nov. 2012), 1--14.
[59]
T. Su, T. Zhang, and D. Guan. 2007. Corpus-based HIT-MW database for offline recognition of general-purpose Chinese handwritten text. International Journal of Document Analysis and Recognition (IJDAR) 10, 1 (2007), 27--38.
[60]
S. Sutat and L. Methasate. 2004. Thai handwritten character corpus. IEEE International Symposium on Communications and Information Technology 1 (Oct 2004), 486--491.
[61]
C. Viard-Gaudin, P. M. Lallican, S. Knerr, and P. Binter. 1999. The IRESTE on/off (IRONOFF) dual handwriting database. In Proceedings of the 5th International Conference on Document Analysis and Recognition, 1999 (ICDAR’99). 455--458.
[62]
R. Wilkinson. 1992. The first census optical character recognition systems. In The U.S. Bureau of Census and the National Institute of Standards and Technology (Tech. Rep. NISTIR 4912, National Institute of Standards and Technology.). Gaithersburg, MD, 1--372.
[63]
F. Yin and C.-L. Liu. 2009. Handwritten Chinese text line segmentation by clustering with distance metric learning. Pattern Recognition 42, 12 (2009), 3146--3157. New Frontiers in Handwriting Recognition.
[64]
F. Yin, Q.-F. Wang, and C.-L. Liu. 2009. A tool for ground-truthing text lines and characters in offline handwritten Chinese documents. In Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR’09). 951--955.
[65]
M. Ziaratban, K. Faez, and F. Bagheri. 2009. FHT: An unconstraint Farsi handwritten text database. In Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR’09). IEEE Computer Society, Washington, DC, 281--285.

Cited By

View all
  • (2024)Efficient CRNN: Towards end-to-end low resource Urdu text recognition using depthwise separable convolutions and gated recurrent unitsInformation Processing & Management10.1016/j.ipm.2023.10354461:1(103544)Online publication date: Jan-2024
  • (2023)Analysis of Cursive Text Recognition Systems: A Systematic Literature ReviewACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359260022:7(1-30)Online publication date: 20-Jul-2023
  • (2023)UTRNet: High-Resolution Urdu Text Recognition in Printed DocumentsDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41734-4_19(305-324)Online publication date: 19-Aug-2023
  • Show More Cited By

Index Terms

  1. A Four-Tier Annotated Urdu Handwritten Text Image Dataset for Multidisciplinary Research on Urdu Script

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 15, Issue 4
      June 2016
      173 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/2915955
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 16 May 2016
      Accepted: 01 December 2015
      Revised: 01 December 2015
      Received: 01 January 2015
      Published in TALLIP Volume 15, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. OCR algorithms benchmarking
      2. Urdu handwritten text
      3. annotation
      4. corpus

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)5
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 15 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Efficient CRNN: Towards end-to-end low resource Urdu text recognition using depthwise separable convolutions and gated recurrent unitsInformation Processing & Management10.1016/j.ipm.2023.10354461:1(103544)Online publication date: Jan-2024
      • (2023)Analysis of Cursive Text Recognition Systems: A Systematic Literature ReviewACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359260022:7(1-30)Online publication date: 20-Jul-2023
      • (2023)UTRNet: High-Resolution Urdu Text Recognition in Printed DocumentsDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41734-4_19(305-324)Online publication date: 19-Aug-2023
      • (2022)Word Level Script Identification Using Convolutional Neural Network Enhancement for Scenic ImagesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/350669921:4(1-29)Online publication date: 4-Mar-2022
      • (2022)UOHTD: Urdu Offline Handwritten Text DatasetFrontiers in Handwriting Recognition10.1007/978-3-031-21648-0_34(498-511)Online publication date: 4-Dec-2022
      • (2021)Offline Pashto Characters Dataset for OCR SystemsSecurity and Communication Networks10.1155/2021/35438162021Online publication date: 1-Jan-2021
      • (2021)UrduAI: Writeprints for Urdu Authorship IdentificationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/347646721:2(1-18)Online publication date: 31-Oct-2021
      • (2021)MMU-OCR-21: Towards End-to-End Urdu Text Recognition Using Deep LearningIEEE Access10.1109/ACCESS.2021.31107879(124945-124962)Online publication date: 2021
      • (2019)POS Tagging and Structural Annotation of Handwritten Text Image Corpus of Devnagari ScriptEmerging Technologies in Computer Engineering: Microservices in Big Data Analytics10.1007/978-981-13-8300-7_24(286-297)Online publication date: 18-May-2019
      • (2018)Artificial Urdu Text Detection and Localization from Individual Video FramesMehran University Research Journal of Engineering and Technology10.22581/muet1982.1802.1837:2(429-438)Online publication date: 1-Apr-2018
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media